本文主要介绍如何使用朴素贝叶斯模型进行邮件分类
,置于朴素贝叶斯模型的原理及分类,可以参考我的上一篇文章《【阿旭机器学习实战】【10】朴素贝叶斯模型原理及3种贝叶斯模型对比:高斯分布朴素贝叶斯、多项式分布朴素贝叶斯、伯努利分布朴素贝叶斯》
。
文本分类实战
读取文本数据
import pandas as pd
# sep参数代表指定的csv的属性分割符号 sms = pd.read_csv("../data/SMSSpamCollection",sep="\t",header=None) sms
5572 rows × 2 columns
提取特征与标签
data = sms[[1]] target = sms[[0]]
data.shape • 1
(5572, 1) • 1
将文本变为稀疏矩阵
对于文本数据,一般情况下会把字符串里面单词转化成浮点数表示稀疏矩阵
from sklearn.feature_extraction.text import TfidfVectorizer # 这个算法模型用于把一堆字符串处理成稀疏矩阵
tf = TfidfVectorizer() # 训练特征数:告诉tf模型有那些单词 tf.fit(data[1])
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True, stop_words=None, strip_accents=None, sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True, vocabulary=None)
# 转化:把数据有5572条语句转化成5572*XX的一个稀疏矩阵 data = tf.transform(data[1]) data # 此时得到了一个5572*8713的稀疏矩阵,说明这5572条语句中有8713种单词
<5572x8713 sparse matrix of type '<class 'numpy.float64'>' with 74169 stored elements in Compressed Sparse Row format>
训练模型
b_NB.fit(data,target)
message = ["Confidence doesn't need any specific reason. If you're alive , you should feel 100 percent confident.", "Avis is only NO.2 in rent a cars.SO why go with us?We try harder.", "SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info" ]
预测
# 把message转化成稀疏矩阵 x_test = tf.transform(message) • 1 • 2
b_NB.predict(x_test)
array(['ham', 'ham', 'spam'], dtype='<U4') • 1 • 2
b_NB.score(data,target) • 1
0.98815506101938266
使用多项式贝叶斯
m_NB = MultinomialNB() • 1
m_NB.fit(data,target) • 1
m_NB.score(data,target) • 1
0.97613065326633164
使用高斯贝叶斯
g_NB = GaussianNB() • 1
g_NB.fit(data.toarray(),target) • 1
g_NB.score(data.toarray(),target) • 1
0.94149318018664752