开发者学堂课程【人工智能必备基础:概率论与数理统计:Python 卡方检验实例】学习笔记,与课程紧密联系,让用户快速学习知识。
课程地址:https://developer.aliyun.com/learning/course/545/detail/7453
Python 卡方检验实例
内容介绍
一.数据导入
二.卡方检验
三.假设检验
四.统计期望值
探究问题:白人和黑人在求职路上会有种族的歧视吗?
一.数据导入
import pandas as pd
import numpy as np
from scipy import stats
data=pd.io.stata.read_stata('us_job_market_discrimination.dta')
data.head()
blacks = data[data.race == 'b’] // b 代表黑人
whites = data[data.race == 'w']// w 代表白人
blacks.call.describe()// call 面试结果,结果为 1 代表被录取,结果为 0 代表没有被录取
所导入的数据
count 2435.000000
mean 0.064476
std 0.245649
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name:call,dtype:float64
写入
whites.call.describe()
所导入的数据
count 2435.000000
mean 0.096509
std 0.295346
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: call, dtype: float64
二.卡方检验
白人获得职位
白人被拒绝·
黑人获得职位
黑人被拒绝
三.假设检验
HO: 种族对求职结果没有显著影响
H1: 种族对求职结果有影响
blacks called = len(blacks[blacks['call'] ==True])// True 代表获得了职位
blacks_not_called = len(blacks[blacks['call'] == False])// False 代表没获得职位
whites called = len(whites[whites['call'] == True])
whites_not_called = len(whites[whites['call'] == False])
observed =pd.DataFrame({'blacks :('called':blacks called, 'not called': blacks not called}.
whites':('called’:whites_called,'not_called’:whites_not_called}})
observed
所得结果
四.统计期望值
num_called back=blackscalled + whites_called//获得职位的人数
num not called =blacks_not_called + whites_not_called//没有获得职位的人数
print(num_called back)
print(num_not_called)
392
4478
rate_of_callbacks=num called back/num not called
得到期望的比率
输入 rate of_callbacks
输出 0.08049281314168377
输入 expected_called = len(data) rate_of_callbacks
expected not_called=len(data)*(1-rateof_callbacks)
print(expected_called)
print(expected_not_called)
输出 391.99999999999994
4478.0
输入 import scipy.statsas stats
observed frequencies = [blacks not called, whites not called, whites called, blacks called
expected frequencies = [expected not called/2, expected not called/2, expected called/2, expected called/2
stats.chisquare(f obs =observed frequencies.
f_exp=expected frequencies)//卡方检验
输出 Power divergenceResult(statistic=16.879050414270221, pvalue=0.00074839594410972638)
看起来种族歧视是存在的!

