資源簡(jiǎn)介
python編程語(yǔ)言 預(yù)處理 統(tǒng)計(jì)詞頻 計(jì)算IT-IDF
代碼片段和文件信息
#?coding=utf-8???????????????????#注意此句注釋要放在第一句,可進(jìn)行中文注釋
from?__future__?import?division??#?保證得到正常除法計(jì)算的結(jié)果
import?os
import?math
import?nltk
import?codecs??#?保證可以用指定的編碼格式打開(kāi)文件
from?nltk.corpus?import?stopwords
import?sys
reload(sys)
sys.setdefaultencoding(‘utf-8‘)
doc_num?=?0??#?統(tǒng)計(jì)文檔總數(shù)
dictionary?=?[]??#?詞典,即文件中的單詞列表
word_idf_dict?=?{}??#?字典,存放詞典中每個(gè)單詞的idf值
num_doc_word?=?{}??#?字典,存放每篇論文的單詞總數(shù)
def?processing(str):
????“對(duì)給定的文本進(jìn)行預(yù)處理返回值是一個(gè)列表,存儲(chǔ)預(yù)處理之后的單詞“
????str_lower?=?str.lower()??#?將文本小寫(xiě)化
????sens?=?nltk.sent_tokenize(str_lower)??#?將小寫(xiě)化的文本進(jìn)行句子分詞
????words?=?[]
????#?對(duì)每個(gè)句子進(jìn)行分詞
????for?sen?in?sens:
????????words.extend(nltk.word_tokenize(sen))??#?注意區(qū)分append?VS?extend
????????stopword?=?stopwords.words(‘english‘)??#?去除停頓詞
????????punctuation?=?[‘‘?‘.‘?‘:‘?‘;‘?‘(‘?‘)‘?‘[‘?‘]‘?‘&‘?‘#‘?‘!‘?‘?‘?‘@‘?‘$‘?‘%‘]??#?去除標(biāo)點(diǎn)符號(hào)
????????stemming?=?nltk.stem.SnowballStemmer(‘english‘)??#?提取詞干
????????new_words?=?[]
????????for?word?in?words:
????????????if?(word.isalpha())?and?(word?not?in?stopword)?and?(word?not?in?punctuation):??#?去除亂碼,即非字母的單詞
????????????????new_words.append(stemming.stem(word))??#?append?VS?extend
????return?new_words
def?compute_tf(wordlist):
????“統(tǒng)計(jì)給定單詞列表的詞頻,返回值是一個(gè)詞典,key為單詞,value為該單詞對(duì)應(yīng)的詞頻“
????temp_dict?=?{}
????for?word?in?wordlist:
????????if?word?in?temp_dict:
????????????temp_dict[word]?+=?1
????????else:
????????????temp_dict[word]?=?1
????return?temp_dict
def?compute_idf(word_in_file):
????“統(tǒng)計(jì)單詞的逆向文檔頻率,參數(shù)的數(shù)據(jù)類(lèi)型為dict{file_name:dict{word:word_tf}}}“
????for?word?in?dictionary:
????????word_in_doc?=?0??#?統(tǒng)計(jì)出現(xiàn)過(guò)該單詞的文檔數(shù)目
????????for?index?in?word_in_file:
????????????if?word?in?word_in_file[index].keys():
????????????????word_in_doc?+=?1
????????word_in_doc?=?math.log10(doc_num?/?word_in_doc)??#?計(jì)算單詞的idf值
????????word_idf_dict[word]?=?word_in_doc??#?得到全局變量word_idf_dict
????????#?此處沒(méi)有將word_idf_dict作為返回值返回,而是將其定義為全局變量,便于其他函數(shù)使用
def?compute_tfidf(word_in_file):
????“計(jì)算單詞的tf-idf值,參數(shù)的數(shù)據(jù)類(lèi)型為dict{file_name:dict{word:word_tf}}“
????word_tfidf?=?{}??#?存放單詞的tf-idf值,數(shù)據(jù)類(lèi)型為dict{file_name:dict{word:word_tfidf}}
????for?index?in?word_in_file:
????????word_tfidf[index]?=?{}??#?存放指定文檔下的dict{word:word_tfidf}
????????temp_len?=?num_doc_word[index]??#?使用全局變量num_doc_word,計(jì)算指定文檔下的單詞總數(shù)
????????for?word?in?word_in_file[index].keys():
????????????word_tfidf[index][word]
評(píng)論
共有 條評(píng)論