資源簡介
gensim包的tfidf方法計算中文文本相似度,代碼可直接運行,包含中文停用詞,方便。

代碼片段和文件信息
#?-*-?coding:?utf-8?-*-
#?@Time????:?2018/7/30?15:52
#?@Author??:?yip
#?@Email???:?522364642@qq.com
#?@File????:?ChineseSimilarityCalculation.py
“““
基于gensim模塊的中文句子相似度計算
思路如下:
1.文本預處理:中文分詞,去除停用詞
2.計算詞頻
3.創建字典(單詞與編號之間的映射)
4.將待比較的文檔轉換為向量(詞袋表示方法)
5.建立語料庫
6.初始化模型
7.創建索引
8.相似度計算并返回相似度最大的文本
“““
from?gensim?import?corpora?models?similarities
import?logging
from?collections?import?defaultdict
import?jieba
#?設置日志
logging.basicConfig(format=‘%(asctime)s?:?%(levelname)s?:?%(message)s‘?level=logging.INFO)
#?準備數據:現有8條文本數據,將8條文本數據放入到list中
documents?=?[“1)鍵盤是用于操作設備運行的一種指令和數據輸入裝置,也指經過系統安排操作一臺機器或設備的一組功能鍵(如打字機、電腦鍵盤)“
?????????????“2)鼠標稱呼應該是“鼠標器”,英文名“Mouse”,鼠標的使用是為了使計算機的操作更加簡便快捷,來代替鍵盤那繁瑣的指令。“
?????????????“3)中央處理器(CPU,Central?Processing?Unit)是一塊超大規模的集成電路,是一臺計算機的運算核心(Core)和控制核心(?Control?Unit)。“
?????????????“4)硬盤是電腦主要的存儲媒介之一,由一個或者多個鋁制或者玻璃制的碟片組成。碟片外覆蓋有鐵磁性材料。“
?????????????“5)內存(Memory)也被稱為內存儲器,其作用是用于暫時存放CPU中的運算數據,以及與硬盤等外部存儲器交換的數據。“
?????????????“6)顯示器(display)通常也被稱為監視器。顯示器是屬于電腦的I/O設備,即輸入輸出設備。它是一種將一定的電子文件通過特定的傳輸設備顯示到屏幕上再反射到人眼的顯示工具。“
?????????????“7)顯卡(Video?card,Graphics?card)全稱顯示接口卡,又稱顯示適配器,是計算機最基本配置、最重要的配件之一。“
?????????????“8)cache高速緩沖存儲器一種特殊的存儲器子系統,其中復制了頻繁使用的數據以利于快速訪問。“]
#?待比較的文檔
new_doc?=?“內存又稱主存,是CPU能直接尋址的存儲空間,由半導體器件制成。“
#?1.文本預處理:中文分詞,去除停用詞
print(‘1.文本預處理:中文分詞,去除停用詞‘)
#?獲取停用詞
stopwords?=?set()
file?=?open(“stopwords.txt“?‘r‘?encoding=‘UTF-8‘)
for?line?in?file:
????stopwords.add(line.strip())
file.close()
#?將分詞、去停用詞后的文本數據存儲在list類型的texts中
texts?=?[]
for?line?in?documents:
????words?=?‘?‘.join(jieba.cut(line)).split(‘?‘)????#?利用jieba工具進行中文分詞
????text?=?[]
????#?過濾停用詞,只保留不屬于停用詞的詞語
????for?word?in?words:
????????if?word?not?in?stopwords:
????????????text.append(word)
????texts.append(text)
for?line?in?texts:
????print(line)
#?待比較的文檔也進行預處理(同上)
words?=?‘?‘.join(jieba.cut(new_doc)).split(‘?‘)
new_text?=?[]
for?word?in?words:
????if?word?not?in?stopwords:
????????new_text.append(word)
print(new_text)
#?2.計算詞頻
print(‘2.計算詞頻‘)
frequency?=?defaultdict(int)??#?構建一個字典對象
#?遍歷分詞后的結果集,計算每個詞出現的頻率
for?text?in?texts:
????for?word?in?text:
????????frequency[word]?+=?1
#?選擇頻率大于1的詞(根據實際需求確定)
texts?=?[[word?for?word?in?text?if?frequency[word]?>?1]?for?text?in?texts]
for?line?in?texts:
????print(line)
#?3.創建字典(單詞與編號之間的映射)
print(‘3.創建字典(單詞與編號之間的映射)‘)
dictionary?=?corpora.Dictionary(texts)
print(dictionary)
#?打印字典,key為單詞,value為單詞的編號
print(dictionary.token2id)
#?4.將待比較的文檔轉換為向量(詞袋表示方法)
print(‘4.將待比較的文檔轉換為向量(詞袋表示方法)‘)
#?使用doc2bow方法對每個不同單詞的詞頻進行了統計,并將單詞轉換為其編號,然后以稀疏向量的形式返回結果
new_vec?=?dictionary.doc2bow(new_text)
print(new_vec)
#?5.建立語料庫
print(‘5.建立語料庫‘)
#?將每一篇文檔轉換為向量
corpus?=?[dictionary.doc2bow(text)?for?text?in?texts]
print(corpus)
#?6.初始化模型
print(‘6.初始化模型‘)
#?初始化一個tfidf模型可以用它來轉換向量(詞袋整數計數),表示方法為新的表示方法(Tfidf?實數權重)
tfidf?=?models.TfidfModel(corpus)
#?將整個語料庫轉為tfidf表示方法
corpus_tfidf?=?tfidf[corpus]
for?doc?in?corpus_tfidf:
????print(doc)
#?7.創
?屬性????????????大小?????日期????時間???名稱
-----------?---------??----------?-----??----
?????目錄???????????0??2019-01-18?14:53??ChineseSimilarity-gensim-tfidf\
?????目錄???????????0??2019-01-18?14:53??ChineseSimilarity-gensim-tfidf\.git\
?????文件??????????79??2018-08-01?15:53??ChineseSimilarity-gensim-tfidf\.git\COMMIT_EDITMSG
?????文件?????????329??2018-08-01?15:54??ChineseSimilarity-gensim-tfidf\.git\config
?????文件??????????48??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\desc
?????文件?????????124??2018-12-14?16:54??ChineseSimilarity-gensim-tfidf\.git\FETCH_HEAD
?????文件??????????23??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\HEAD
?????目錄???????????0??2019-01-18?14:53??ChineseSimilarity-gensim-tfidf\.git\hooks\
?????文件?????????478??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\hooks\applypatch-msg.sample
?????文件?????????896??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\hooks\commit-msg.sample
?????文件????????3327??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\hooks\fsmonitor-watchman.sample
?????文件?????????189??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\hooks\post-update.sample
?????文件?????????424??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\hooks\pre-applypatch.sample
?????文件????????1642??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\hooks\pre-commit.sample
?????文件????????1348??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\hooks\pre-push.sample
?????文件????????4898??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\hooks\pre-reba
?????文件?????????544??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\hooks\pre-receive.sample
?????文件????????1492??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\hooks\prepare-commit-msg.sample
?????文件????????3610??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\hooks\update.sample
?????文件?????????393??2018-08-01?15:53??ChineseSimilarity-gensim-tfidf\.git\index
?????目錄???????????0??2019-01-18?14:53??ChineseSimilarity-gensim-tfidf\.git\info\
?????文件?????????240??2018-08-01?15:49??ChineseSimilarity-gensim-tfidf\.git\info\exclude
?????目錄???????????0??2019-01-18?14:53??ChineseSimilarity-gensim-tfidf\.git\logs\
?????文件?????????374??2018-08-01?15:53??ChineseSimilarity-gensim-tfidf\.git\logs\HEAD
?????目錄???????????0??2019-01-18?14:53??ChineseSimilarity-gensim-tfidf\.git\logs\refs\
?????目錄???????????0??2019-01-18?14:53??ChineseSimilarity-gensim-tfidf\.git\logs\refs\heads\
?????文件?????????374??2018-08-01?15:53??ChineseSimilarity-gensim-tfidf\.git\logs\refs\heads\master
?????目錄???????????0??2019-01-18?14:53??ChineseSimilarity-gensim-tfidf\.git\logs\refs\remotes\
?????目錄???????????0??2019-01-18?14:53??ChineseSimilarity-gensim-tfidf\.git\logs\refs\remotes\origin\
?????文件?????????959??2018-08-01?16:18??ChineseSimilarity-gensim-tfidf\.git\logs\refs\remotes\origin\master
?????目錄???????????0??2019-01-18?14:53??ChineseSimilarity-gensim-tfidf\.git\ob
............此處省略52個文件信息
評論
共有 條評論