实验步骤

1.导入模块与数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,accuracy_score
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['font.sans-serif']=['SimHei'] #显示中文标签
plt.rcParams['axes.unicode_minus']=False
# 导入数据
data = pd.read_csv('data.csv')
data.head()

Age Height  Weight  family_history_with_overweight  FAVC  FCVC  NCP CAEC  SMOKE CH2O  SCC FAF TUE CALC  MTRANS  NObeyesdad
0 Female  21.0  1.62  64.0  yes no  2.0 3.0 Sometimes no  2.0 no  0.0 1.0 no  Public_Transportation Normal_Weight
1 Female  21.0  1.52  56.0  yes no  3.0 3.0 Sometimes yes 3.0 yes 3.0 0.0 Sometimes Public_Transportation Normal_Weight
2 Male  23.0  1.80  77.0  yes no  2.0 3.0 Sometimes no  2.0 no  2.0 1.0 Frequently  Public_Transportation Normal_Weight
3 Male  27.0  1.80  87.0  no  no  3.0 3.0 Sometimes no  2.0 no  2.0 0.0 Frequently  Walking Overweight_Level_I
4 Male  22.0  1.78  89.8  no  no  2.0 1.0 Sometimes no  2.0 no  0.0 0.0 Sometimes Public_Transportation Overweight_Level_II

2.查看数据

# 查看数据大小
data.shape

(2111, 17)

该数据共有2111行，17列数据

# 查看数据类型
data.dtypes

Gender                             object
Age                               float64
Height                            float64
Weight                            float64
family_history_with_overweight     object
FAVC                               object
FCVC                              float64
NCP                               float64
CAEC                               object
SMOKE                              object
CH2O                              float64
SCC                                object
FAF                               float64
TUE                               float64
CALC                               object
MTRANS                             object
NObeyesdad                         object
dtype: object

数据类型有object和float两种类型

# 查看数值型数据描述
data.describe()

Age       Height  Weight  FCVC  NCP CH2O  FAF TUE
count 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000
mean  24.312600 1.701677  86.586058 2.419043  2.685628  2.008011  1.010298  0.657866
std     6.345968  0.093305  26.191172 0.533927  0.778039  0.612953  0.850592  0.608927
min    14.000000  1.450000  39.000000 1.000000  1.000000  1.000000  0.000000  0.000000
25%    19.947192  1.630000  65.473343 2.000000  2.658738  1.584812  0.124505  0.000000
50%    22.777890  1.700499  83.000000 2.385502  3.000000  2.000000  1.000000  0.625350
75%    26.000000  1.768464  107.430682  3.000000  3.000000  2.477420  1.666678  1.000000
max    61.000000  1.980000  173.000000  3.000000  4.000000  3.000000  3.000000  2.000000

可以看出数值型数据的总数、平均值、标准差、最大最小值、4分位值

# 查看非数值型数据描述

# 查看非数值型数据描述
data.describe(include=np.object)

  Gender  family_history_with_overweight  FAVC  CAEC  SMOKE SCC CALC  MTRANS  NObeyesdad
count 2111  2111  2111  2111  2111  2111  2111  2111  2111
unique  2 2 2 4 2 2 4 5 7
top Male  yes yes Sometimes no  no  Sometimes Public_Transportation Obesity_Type_I
freq  1068  1726  1866  1765  2067  2015  1401  1580  351

可以看出非数值型数据的总数、数值类型的个数、出现次数最多的值以及出现的频率

3.数据预处理

# 查看缺失值
data.isnull().sum()

Gender                            0
Age                               0
Height                            0
Weight                            0
family_history_with_overweight    0
FAVC                              0
FCVC                              0
NCP                               0
CAEC                              0
SMOKE                             0
CH2O                              0
SCC                               0
FAF                               0
TUE                               0
CALC                              0
MTRANS                            0
NObeyesdad                        0
dtype: int64

可以看出数据没有缺失值，不需要处理

# 查看重复值
any(data.duplicated())

True

data.duplicated()返回的是一堆布尔值，重复数据第一次出现为False，第二次以后均为True，故我们可以用any()函数来进行判断，当数据只要有有一个重复值，则最终结果为True，否则为False。本次为True，说明数据存在重复值。需要处理

# 删除重复行
data.drop_duplicates(inplace=True)
data.shape

(2087, 17)

原始数据有2111行，删除重复值还剩2087行

4.可视化分析

不同肥胖程度的总人数

data['NObeyesdad'].value_counts().plot.barh()

我们可以看出不同程度的人数相差不大

分析不同肥胖程度的男女比例

sex_group = data.groupby(['NObeyesdad','Gender'])['Gender'].count()
sex_group
sex_group.plot(kind='bar')

NObeyesdad           Gender
Insufficient_Weight  Female    169
                     Male       98
Normal_Weight        Female    137
                     Male      145
Obesity_Type_I       Female    156
                     Male      195
Obesity_Type_II      Female      2
                     Male      295
Obesity_Type_III     Female    323
                     Male        1
Overweight_Level_I   Female    145
                     Male      131
Overweight_Level_II  Female    103
                     Male      187
Name: Gender, dtype: int64

在体重瘦弱人数中，女性远多于男性；在肥胖2级人数中，男性远超过女性；在肥胖3级人数中，女性远超过男性；其余比例相差不大。

分析家庭肥胖历史对肥胖程度的影响

family_group = data.groupby(['NObeyesdad','family_history_with_overweight'])['family_history_with_overweight'].count()
family_group
family_group.plot.bar()

NObeyesdad           family_history_with_overweight
Insufficient_Weight  no                                142
                     yes                               125
Normal_Weight        no                                130
                     yes                               152
Obesity_Type_I       no                                  7
                     yes                               344
Obesity_Type_II      no                                  1
                     yes                               296
Obesity_Type_III     yes                               324
Overweight_Level_I   no                                 67
                     yes                               209
Overweight_Level_II  no                                 18
                     yes                               272
Name: family_history_with_overweight, dtype: int64

我们可以看出肥胖1-3级和超重1-2级的人数几乎都有家庭肥胖历史，说明家庭肥胖是可以遗传给后代的。

特征相关性分析

#相关性
import seaborn as sns
fig = plt.figure(figsize=(18,18))
sns.heatmap(data.corr(),vmax=1)

画出热力图便于观察个特征之间的相关性，颜色越深说明相关性越强

5.特征工程

为了后面更好的建立模型，我们将NObeyesdad肥胖等级用0-6来表示体重不足、正常体重、超重一级、超重二级、肥胖一级、肥胖二级和肥胖三级；将CAEC、CALC的值用1-4表示；将MTRANS值用1-5表示；将family_history_with_overweight、FAVC、SMOKE、SCC、Gender的值用0,1表示。

# 将NObeyesdad肥胖等级用0-6来表示 体重不足、正常体重、超重一级、超重二级、肥胖一级、肥胖二级和肥胖三级
data.NObeyesdad.replace(to_replace={'Insufficient_Weight':0,
                                    'Normal_Weight':1,
                                    'Overweight_Level_I':2,
                                    'Overweight_Level_II':3,
                                    'Obesity_Type_I':4,
                                    'Obesity_Type_II':5,
                                    'Obesity_Type_III':6},inplace=True)
data['NObeyesdad'].value_counts()
# 将CAEC、CALC的值用1-4表示
data.CAEC.replace(to_replace={'no':1,
                                'Sometimes':2,
                                'Frequently':3,
                                'Always':4},inplace=True)
data.CALC.replace(to_replace={'no':1,
                                'Sometimes':2,
                                'Frequently':3,
                                'Always':4},inplace=True)
# 将MTRANS值用1-5表示
data.MTRANS.replace(to_replace={'Bike':1,
                                'Motorbike':2,
                                'Walking':3,
                                'Automobile':4,
                                'Public_Transportation':5},inplace=True)
# 将family_history_with_overweight、FAVC、SMOKE、SCC、Gender的值用0,1表示
data['family_history_with_overweight'] = data['family_history_with_overweight'].apply(lambda x:0 if x == 'no' else 1)
data['FAVC'] = data['FAVC'].apply(lambda x:0 if x == 'no' else 1)
data['SMOKE'] = data['SMOKE'].apply(lambda x:0 if x == 'no' else 1)
data['SCC'] = data['SCC'].apply(lambda x:0 if x == 'no' else 1)
data['Gender'] = data['Gender'].apply(lambda x:0 if x == 'Female' else 1)

来看一下经过值变换后的结果

data.head(10)

  Gender  Age Height  Weight  family_history_with_overweight  FAVC  FCVC  NCP CAEC  SMOKE CH2O  SCC FAF TUE CALC  MTRANS  NObeyesdad
0 0 21.0  1.62  64.0  1 0 2.0 3.0 2 0 2.0 0 0.0 1.0 1 5 1
1 0 21.0  1.52  56.0  1 0 3.0 3.0 2 1 3.0 1 3.0 0.0 2 5 1
2 1 23.0  1.80  77.0  1 0 2.0 3.0 2 0 2.0 0 2.0 1.0 3 5 1
3 1 27.0  1.80  87.0  0 0 3.0 3.0 2 0 2.0 0 2.0 0.0 3 3 2
4 1 22.0  1.78  89.8  0 0 2.0 1.0 2 0 2.0 0 0.0 0.0 2 5 3
5 1 29.0  1.62  53.0  0 1 2.0 3.0 2 0 2.0 0 0.0 0.0 2 4 1
6 0 23.0  1.50  55.0  1 1 3.0 3.0 2 0 2.0 0 1.0 0.0 2 2 1
7 1 22.0  1.64  53.0  0 0 2.0 3.0 2 0 2.0 0 3.0 0.0 2 5 1
8 1 24.0  1.78  64.0  1 1 3.0 3.0 2 0 2.0 0 1.0 1.0 3 5 1
9 1 22.0  1.72  68.0  1 1 2.0 3.0 2 0 2.0 0 1.0 1.0 1 5 1

6.构建模型

首先要划分数据集

# 划分训练集和测试集
X = data.drop('NObeyesdad',axis=1)
y = data['NObeyesdad']
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

1.构建决策树模型

# 决策树
tree = DecisionTreeClassifier()
tree.fit(x_train,y_train)
y_pred = tree.predict(x_test)
print('模型准确率',accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))

模型准确率 0.9138755980861244
[[54  5  0  0  0  0  0]
 [ 8 42 11  0  0  0  0]
 [ 0  4 49  1  1  0  0]
 [ 0  0  0 47  2  0  0]
 [ 0  0  0  2 67  1  0]
 [ 0  0  0  0  1 63  0]
 [ 0  0  0  0  0  0 60]]

我们可以看出决策树模型的准确率为0.91，下面是它的混淆矩阵

2.构建随机森林模型

# 训练模型
rfc = RandomForestClassifier(n_estimators=1000)
rfc.fit(x_train,y_train)
y_pred = rfc.predict(x_test)
print('模型准确率',accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
#打印特征重要性评分
feat_labels = x_train.columns[0:]
importances = rfc.feature_importances_
indices = np.argsort(importances)[::-1]
for f,j in zip(range(x_train.shape[1]-1),indices):
    print(f + 1, feat_labels[j], importances[j])

模型准确率 0.9688995215311005
[[56  3  0  0  0  0  0]
 [ 1 58  2  0  0  0  0]
 [ 0  4 50  1  0  0  0]
 [ 0  0  0 49  0  0  0]
 [ 0  0  0  2 68  0  0]
 [ 0  0  0  0  0 64  0]
 [ 0  0  0  0  0  0 60]]
1 Weight 0.3461717299548839
2 Height 0.10306677126361354
3 Age 0.09179444276446319
4 FCVC 0.08913744112847972
5 Gender 0.060092403930844605
6 NCP 0.05001535496608815
7 TUE 0.0453552733033558
8 FAF 0.041620900666372085
9 CH2O 0.040322835978721744
10 family_history_with_overweight 0.031376522711946964
11 CAEC 0.029667089265592847
12 CALC 0.028755767084792445
13 MTRANS 0.01894906014847046
14 FAVC 0.016471000893701973
15 SCC 0.0051458394162270426

我们可以看出随机森林模型的准确率为0.968，其中重要特征排名中，体重、身高、年龄、食用蔬菜的评率、性别、主餐次数等因素重要程度得分较高，说明重要程度越高。

3.构建GBDT模型

from sklearn.ensemble import GradientBoostingClassifier
gbdt = GradientBoostingClassifier()
gbdt.fit(x_train,y_train)
y_pred = gbst.predict(x_test)
print('模型准确率',accuracy_score(y_pred,y_test))
print(confusion_matrix(y_test,y_pred))

模型准确率 0.9617224880382775
[[54  5  0  0  0  0  0]
 [ 3 53  5  0  0  0  0]
 [ 0  1 54  0  0  0  0]
 [ 0  0  0 49  0  0  0]
 [ 0  0  0  1 69  0  0]
 [ 0  0  0  0  1 63  0]
 [ 0  0  0  0  0  0 60]]

我们可以看出模型的准确率为0.96，准确率较高

综上3种分类模型算法，随机森林模型的准确率最高，我们建议用随机森林来进行预测和探究肥胖的成因。

实验总结

肥胖是一种全球性的疾病，无论人们的社会或文化水平如何，它始终都是热点话题，而且全球患者的数量逐年增长。为了帮助对抗这种疾病，开发工具和解决方案去检测或预测疾病的出现显得非常重要，而数据挖掘是让我们发现信息的重要工具。

本文使用随机森林算法对数据集进行处理，通过对多个影响因子进行多标签分类获取各影响因子与肥胖水平之间的权值，由此建立肥胖评估模型，模型准确率达到96%，从而探究肥胖的成因。实验结果表明了众多影响因子与肥胖水平之间的关系，肥胖家族病史与肥胖水平之间强正相关，年龄以及是否经常食用高热量也与肥胖水平之间呈较强的正相关关系，也就是说，通常有肥胖家族病史的人患病可能性更大，年龄越大以及经常食用高热量食物的人更容易肥胖；是否进行卡路里消耗监测以及是否经常活动身体等与肥胖水平有着负相关关系，换言之，规律的监测卡路里消耗以及频繁的身体活动可以降低患病几率；是否频繁饮酒、长时间使用技术设备每日饮水量等对肥胖水平有一定影响。

因此，根据实验结果，要想控制肥胖应努力加强家庭可以采用的健康习惯，例如均衡白天的饮食、确定饮食时间、少吃高热量的食物、降低饮酒频率等；必须认识到，除了饮食变化外，增加日常体育活动，例如每天至少步行半小时，每天至少喝两升水，是必不可少的，因为没有不锻炼的饮食；对卡路里消耗进行规律检测，减少使用技术设备的时间等。儿童和成人的高肥胖率是导致总体肥胖率较高的原因，我们再也不能对此视而不见，应在生命早期阶段就进行预防和控制，这样才能可持续的解决肥胖问题，而我们每一个人也应该提高认识，养成健康的生活习惯。

随着云计算、物联网和移动互联网等技术的飞速发展，数据的类型和规模以前所未有的速度增长，而人工智能和数据挖掘的快速发展提高了数据管理效率。通过本实验对实际案例的研究与学习，对数据挖掘有关的知识有了初步的了解，为以后继续学习数据挖掘与分析奠定了基础。

因为对数据挖掘不够了解，实验过程中遇到了很多问题。实验仍存在很多问题，如实验结果与实际情况存在偏差，模型准确率有待提高；算法的很多代码不够完善，存在漏洞；对实验结果分析不够深入，有待进一步挖掘等等。针对这些不足，在今后不断学习过程中会不断完善。

基于sklearn随机森林算法探究肥胖的成因（二）

实验步骤

1.导入模块与数据

2.查看数据

3.数据预处理

4.可视化分析

5.特征工程

6.构建模型

实验总结

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

基于sklearn随机森林算法探究肥胖的成因（二）

实验步骤

1.导入模块与数据

2.查看数据

3.数据预处理

4.可视化分析

5.特征工程

6.构建模型

实验总结

热门文章

最新文章

相关课程

相关电子书

相关实验场景