开发者学堂课程【人工智能必备基础:概率论与数理统计:案例:预处理问题】学习笔记,与课程紧密联系,让用户快速学习知识。
课程地址:https://developer.aliyun.com/learning/course/545/detail/7441
案例:预处理问题
一、 预处理
如果一个特性的方差比其他的要大得多,那么它可能支配目标函数,使估计者不能像预期的那样正确地从其他特性中学习。这就是为什么我们需要首先对数据进行缩放。
对于指标值来说,有些值偏大,有些值偏小。
对连续值进行标准化[55]: = #target and featurestarget = data. priceregressors = [x for x in data.columns if x not in[‘price ‘]]
features = data. loc[:, regressors]num = [ ‘symboling’,’normalized-losses’ ,’volume ‘,’horsepower’,’wheel-base’,‘bore’,’stroke ‘,’coupression-ratio’ ,’peak-rpm’](先将连续值数据拿出,)# scale the datastandard_ scaler = StandardScaler()(将 StandardScaler 拿过来。也是将函数标准化的工具)features[num] = standard scaler. fit_ transform (features [num ])
#glimpse
Feature.head()
Out[55]:
In (57]: # categorical varsclasses = [‘make’, ‘fuel-type’,’ aspiration’,’ num-of -doors’,‘body-style’,’ dreive-wheels’,’engine- location ‘,’engine-type’, ‘num-of-cylinders’ ,’fuel-system’]# create new dataset with-only continios vars
dummies = pd. get_ dummies (features[classes])features = features. join (dummies). drop (classes,axis = 1)# new datasetprint(‘In total:’, features. shape)
features. head()In total: (193, 66)
Out[57]:
对分类属性就行 one-hot 编码
In [57]: # categorical varsclasses = [‘make’, ‘fuel-type’,’aspiration’,’num-of-doors’,‘body-style’,’drivelwheels’ ,’ engine-location’ ,’engine-type’ ,’num-of-cylinders’ ,’fuel-system’ ]# create new dataset with only continios vars
dummies = pd. get_ dummies (features[classes])features = features. join (dummies). drop(classes,axis = 1)# new datasetprint(‘In total:’ , features. shape)
features. head()In total: (193, 66)
Out[57]:
划分数据集In(58]: F split the data into train/test setх_train, X_test, y train, y test= train_ test_split (features, target,test_size=0.3,random_state- seed)print(" Train”,X train. shape, “and test", X_test. shape)Train (135, 66) and test (58, 66)
Lassoy 回归
多加了一个绝对值项来惩罚过大的系数,alphas=0 那就是最小二乘了In [1078]: # logarithmic scale: 1og base 2#high values to zero-out morevariables
alphas = 2. ** np. arange(2, 12)scores = np. empty_like (alphas)for i, a in enumerate (alphas):lasso = Lasso(random state = seed)
lasso. set_ params (alpha = a)lasso. fit(X_train, y_ train)scores[i] = lasso. score(X_ test, y_test)lassocv = LassoCV(cv=10, random_ state = seed)
lassocv. fit (features, target)
lassocv_ score = lassocv. score (features,- target)
lassocv_ alpha = lassocv. alpha_


