資源簡(jiǎn)介
1. 應(yīng)用樸素貝葉斯算法,對(duì)Content 數(shù)據(jù)集進(jìn)行分類
1)對(duì)數(shù)據(jù)進(jìn)行清洗
2)基于給定的詞庫(kù)和停止詞,進(jìn)行文本切詞
3)建立NB模型
代碼片段和文件信息
import?pandas?as?pd
#?讀入評(píng)論數(shù)據(jù)
evaluation?=?pd.read_excel(r‘Contents.xlsx‘)
#?查看數(shù)據(jù)前10行
print(evaluation.head(10))
#?運(yùn)用正則表達(dá)式,將評(píng)論中的數(shù)字和英文去除
evaluation.Content?=?evaluation.Content.str.replace(‘[0-9a-zA-Z]‘‘‘)
evaluation.head()
#?導(dǎo)入第三方包
import?jieba
#?加載自定義詞庫(kù)
jieba.load_userdict(r‘a(chǎn)ll_words.txt‘)
#?讀入停止詞
with?open(r‘mystopwords.txt‘?encoding=‘UTF-8‘)?as?words:
????stop_words?=?[i.strip()?for?i?in?words.readlines()]
#?構(gòu)造切詞的自定義函數(shù),并在切詞過(guò)程中刪除停止詞
def?cut_word(sentence):
????words?=?[i?for?i?in?jieba.lcut(sentence)?if?i?not?in?stop_words]
????#?切完的詞用空格隔開(kāi)
????result?=?‘?‘.join(words)
????return(result)
#?對(duì)評(píng)論內(nèi)容進(jìn)行批量切詞
words?=?evaluation.Content.apply(cut_word)
#?前5行內(nèi)容的切詞效果
words[:5]
#?導(dǎo)入第三方包
from?sklearn.feature_extraction.text?import?CountVectorizer
#?計(jì)算每個(gè)詞在各評(píng)
?屬性????????????大小?????日期????時(shí)間???名稱
-----------?---------??----------?-----??----
?????文件?????701641??2019-08-25?09:01??all_words.txt
?????文件?????929935??2019-08-25?09:01??Contents.xlsx
?????文件??????99725??2019-08-25?09:01??mystopwords.txt
?????文件???????3050??2019-09-14?21:07??nbtest.py
-----------?---------??----------?-----??----
??????????????1734351????????????????????4
評(píng)論
共有 條評(píng)論