开发者学堂课程【人工智能必备基础:概率论与数理统计:Python 假设检验实例】学习笔记,与课程紧密联系,让用户快速学习知识。
课程地址:https://developer.aliyun.com/learning/course/545/detail/7452
Python 假设检验实例
内容介绍
一. 数据集下载
二. 导入数据
三. 画出分布图
四. 正态检验
一.数据集下载
数据集下载地址: https://ww2.amstat.org/publications/ise/ise data archive.htm
数据集描述: http://ww 2.amstat.org/publications/ise/datase ts/normtem p.txt
包括 130 条记录,我们主要利用体温和性别来进行实验
二.导入数据
import pandas as pd
import pylab
import math
import numpy as np
import matplotlib.pyplot as plt%matplotlib inline
import numpy as np
from scipy.stats import norm
import scipy.stats
import warnings
warnings.filterwarnings("ignore")
df=pd.read csv('normtemp.txt’,sep=’',names = ['Temperature', Gender','Heart Rate'])
df.describe()
df.head()
(所导入的数据)
三.画出分布图
observed temperatures=df['Temperature'].sort_values()//找到 Temperature 列并排序
bin_val = np.arange(start= observed temperatures.min(), stop= observed temperatures.max(), step =.05)
mu, std = np. mean(observed temperatures),np.std(observed temperatures)//计算均值和标准差
p=norm.pdf(observedtemperatures, mu, std)//画出正态分布图
plt.hist(observed temperatures,bins =bin_val, normed=True, stacked=True)
plt.plot(observed temperatures,p,color='red")
plt.xticks(np.arange(95.75,101.25,0.25),rotation=90) plt.xlabel('Human BodyrTemperature Distributions') plt.xlabel(human body temperature)
plt.show()
print('Average (Mu):'+ str(mu)+’/’'Standard Deviation:' +str(s td))
所得正态分布图
四.正态检验
x=observed temperatures
#Shapiro-Wilk Test: https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk test shapiro test, shapirop=scipy.statsshapiro(x)
print("Shapiro-Wilk Stat:",shapiro test,"Shapiro-Wilk p-Value:", shapiro p)
k2,p=scipystats.normaltest(observed_temperatures) print(p:,p)
#Another method to determining normality is through Quantile-Quantile Plots scipy.stats.probplot(observed temperatures,dist=norm,plot=pylab)
pylab.show()
两种方法所得 p 值 Shapiro-Wilk Stat:0.9865769743919373 Shapiro-Wilk p-Value: 02331680953502655 p:0258747986349
蓝点和红线基本重合
三种方法都可以得出所导入的数据符合正态分布
画出 ecdf
def ecdf(data):
#Compute ECDF
n=len(data)
x=np.sort(data)
y=np.arange(1,n+1)/n
return x,y
# Compute empirical mean and standard deviation
# Number of samples
n= len(df['Temperature'])
# Sample mean
mu=np.mean(df['Temperature'])
# Sample standard deviation
std=npstd(df[Temperature'])
print('Mean temperature: ', mu, 'with standard deviation of +/-', std)
#Random sampling of the data based off of the mean of the data.
normalized sample=np.random.normal(mu, std,size=10000) x_temperature, y_temperature=ecdf(df['Temperature']) normalized_x,normalizedy=ecdf(normalized_sample)
黄色的点和蓝色的线基本吻合也可以确定所导入数据符合正态分布
做出假设检验
1.有学者提出 98.6 是人类的平均体温,我们该这样认为吗?
在这里我们选择 t 检验,因为我们只能计算样本的标准差
from scipy import stats
CW_mu=98.6
stats.ttest_lsamp(df['Temperature'],Cw_mu, axis=0)Ttest_IsampResult(statistic=-5.4548232923640771,pvalue=2.410632041561008le-07)
进行 t 检验
T-Stat-5.454p-value 近乎 0 了.我们该拒绝这样的假设
2.男性和女性的体温有明显差异吗
两独立样本 t 检验 HO :没有明显差异 H1 :有明显差异
female_temp=df.Temperature[df.Gender == 2] male_temp=df.Temperature[dfGender == 1]
mean female_temp=np.mean(femaletemp)
mean male temp=npmean(male_temp)
print('Average female body temperature='+str(mean female_temp))
print( Average male body temperature='+str(mean male temp))
# Compute independent t-test
stats.ttest_ind(female_temp,male_temp,axis=0)//传入两列数据
Average female body temperature=9839384615384616 Average male body temperature = 981046153846154
Ttest_indResult(statistic=2.2854345381654984, pvalue=002393188312240236)
由于 P 值 =0024<005,我们需要拒绝原假设,我们有 %95 的自信认为是有差异的!




