案例：预处理问题

案例：预处理问题 | 学习笔记

2022-11-13 138

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 快速学习案例：预处理问题

开发者学堂课程【人工智能必备基础：概率论与数理统计：案例：预处理问题】学习笔记，与课程紧密联系，让用户快速学习知识。

课程地址：https://developer.aliyun.com/learning/course/545/detail/7441

一、预处理

如果一个特性的方差比其他的要大得多，那么它可能支配目标函数，使估计者不能像预期的那样正确地从其他特性中学习。这就是为什么我们需要首先对数据进行缩放。

对于指标值来说，有些值偏大，有些值偏小。
对连续值进行标准化
[55]: = #target and features
target = data. price
regressors = [x for x in data.columns if x not in[‘price ‘]]

features = data. loc[:, regressors]
num = [ ‘symboling’，’normalized-losses’ ，’volume ‘，’horsepower’，’wheel-base’，
‘bore’,’stroke ‘，’coupression-ratio’ ，’peak-rpm’]（先将连续值数据拿出，）
# scale the data
standard_ scaler = StandardScaler()（将 StandardScaler 拿过来。也是将函数标准化的工具）
features[num] = standard scaler. fit_ transform (features [num ])

#glimpse

Feature.head()

Out[55]:

In (57]: # categorical vars
classes = [‘make’, ‘fuel-type’,’ aspiration’,’ num-of -doors’,
‘body-style’,’ dreive-wheels’,’engine- location ‘,’engine-type’, ‘num-of-cylinders’ ,’fuel-system’]
# create new dataset with-only continios vars

dummies = pd. get_ dummies (features[classes])
features = features. join (dummies). drop (classes,
axis = 1)
# new dataset
print(‘In total:’, features. shape)

features. head()
In total: (193, 66)

Out[57]:

对分类属性就行 one-hot 编码

In [57]: # categorical vars
classes = [‘make’, ‘fuel-type’,’aspiration’,’num-of-doors’,
‘body-style’,’drivelwheels’ ,’ engine-location’ ,’engine-type’ ,’num-of-cylinders’ ,’fuel-system’ ]
# create new dataset with only continios vars

dummies = pd. get_ dummies (features[classes])
features = features. join (dummies). drop(classes,
axis = 1)
# new dataset
print(‘In total:’ , features. shape)

features. head()
In total: (193, 66)

Out[57]:

划分数据集
In(58]: F split the data into train/test set
х_train, X_test, y train, y test= train_ test_split (features, target,
test_size=0.3,
random_state- seed)
print(" Train”,X train. shape, “and test", X_test. shape)
Train (135, 66) and test (58, 66)
Lassoy 回归
多加了一个绝对值项来惩罚过大的系数，alphas=0 那就是最小二乘了
In [1078]: # logarithmic scale: 1og base 2
#high values to zero-out morevariables

alphas = 2. ** np. arange(2, 12)
scores = np. empty_like (alphas)
for i, a in enumerate (alphas):
lasso = Lasso(random state = seed)

lasso. set_ params (alpha = a)
lasso. fit(X_train, y_ train)
scores[i] = lasso. score(X_ test, y_test)
lassocv = LassoCV(cv=10, random_ state = seed)

lassocv. fit (features, target)

lassocv_ score = lassocv. score (features,- target)

lassocv_ alpha = lassocv. alpha_

案例：预处理问题 | 学习笔记