91av视频/亚洲h视频/操亚洲美女/外国一级黄色毛片 - 国产三级三级三级三级

  • 大小: 13.63MB
    文件類型: .zip
    金幣: 1
    下載: 0 次
    發(fā)布日期: 2023-07-22
  • 語言: 其他
  • 標(biāo)簽:

資源簡介

NLP小白入門——超全中文文本分類系統(tǒng)(train set & test set + stop word*4 + word2vec + TF-IDF + Naive Bayes)

資源截圖

代碼片段和文件信息

#?-*-?coding:?utf-8?-*-
import?os
import?jieba
import?pickle

from?sklearn?import?metrics
from?sklearn.datasets.base?import?Bunch
from?sklearn.feature_extraction.text?import?TfidfVectorizer?TfidfTransformer
from?sklearn.naive_bayes?import?MultinomialNB
from?numpy?import?*
import?io


class?NLP_C:
????def?__init__(self):
????????self.corpus_path_train?=?“data/train/“
????????self.seg_path_train?=?“data/train_seg/“
????????self.wordbag_path_train?=?“data/train_word_bag/train_set.dat“
????????self.corpus_path_test?=?“data/test/“
????????self.seg_path_test?=?“data/test_seg/“
????????self.wordbag_path_test?=?“data/test_word_bag/test_set.dat“
????????self.stopword_path?=?“data/SiChuanDaXue.txt“
????????#?self.stopword_path?=?“data/train_word_bag/hlt_stop_words.txt“
????????self.space_path_train?=?“data/train_word_bag/tfidfspace.dat“
????????self.space_path_test?=?“data/test_word_bag/testspace.dat“

????def?savefile(self?savepath?content):
????????#?fp?=?open(savepath?‘w‘?encoding=‘gb2312‘?errors=‘ignore‘)
????????fp?=?io.open(savepath?‘w‘?encoding=‘gb2312‘?errors=‘ignore‘)
????????fp.write(content)
????????fp.close()

????def?readfile(self?path):
????????#?fp?=?open(path?‘r‘?encoding=‘gb2312‘?errors=‘ignore‘)
????????fp?=?io.open(path?‘r‘?encoding=‘gb2312‘?errors=‘ignore‘)
????????content?=?fp.read()
????????fp.close()
????????return?content

????def?splitwords(self?corpus_path?seg_path):
????????catelist?=?os.listdir(corpus_path)??#?獲取corpus_path下的所有子目錄

????????#?獲取每個目錄下所有的文件
????????for?mydir?in?catelist:
????????????class_path?=?corpus_path?+?mydir?+?“/“??#?拼出分類子目錄的路徑
????????????seg_dir?=?seg_path?+?mydir?+?“/“??#?拼出分詞后語料分類目錄
????????????if?not?os.path.exists(seg_dir):??#?是否存在目錄,如果沒有創(chuàng)建
????????????????os.makedirs(seg_dir)
????????????file_list?=?os.listdir(class_path)??#?獲取class_path下的所有文件
????????????for?file_path?in?file_list:??#?遍歷類別目錄下文件
????????????????fullname?=?class_path?+?file_path??#?拼出文件名全路徑
????????????????content?=?self.readfile(fullname).strip()??#?讀取文件內(nèi)容
????????????????content?=?content.replace(“\r\n“?““)??#?刪除換行和多余的空格
????????????????content_seg?=?jieba.cut(content.strip())??#?為文件內(nèi)容分詞
????????????????self.savefile(seg_dir?+?file_path?“?“.join(content_seg))??#?將處理后的文件保存到分詞后語料目錄

????#?Bunch類提供一種keyvalue的對象形式
????#?target_name:所有分類集名稱列表
????#?label:每個文件的分類標(biāo)簽列表
????#?filenames:文件路徑
????#?contents:分詞后文件詞向量形式
????def?word2vec(self?seg_path?wordbag_path):
????????bunch?=?Bunch(target_name=[]?label=[]?filenames=[]?contents=[])

????????catelist?=?os.listdir(seg_path)??#?獲取seg_path下的所有子目錄
????????bunch.target_name.extend(catelist)
????????#?獲取每個目錄下所有的文件
????????for?mydir?in?catelist:
????????????class_path?=?seg_path?+?mydir?+?“/“??#?拼出分類子目錄的路徑
????????????file_list?=?os.listdir(class_path)??#?獲取class_path下的所有文件
????????????for?file_path?in?file_list:??#?遍歷類別目錄下文件
????????????????fullname?=?class_path?+?file_path??#?拼出文件名全路徑
????????????????bunch.label.append(mydir)
????????????????bunch.filenames.append(fullname

?屬性????????????大小?????日期????時間???名稱
-----------?---------??----------?-----??----
?????文件????????8738??2018-04-08?21:02??NLP_Test.py
?????目錄???????????0??2018-04-08?20:56??data\
?????文件????????3398??2018-02-12?07:53??data\ChineseStop.txt
?????文件????????3823??2018-02-12?07:53??data\HaGongDa.txt
?????文件????????5389??2018-02-12?07:53??data\SiChuanDaXue.txt
?????目錄???????????0??2018-04-08?16:18??data\TanCorpMinTest\
?????目錄???????????0??2018-04-08?16:18??data\TanCorpMinTest\人才\
?????文件????????3320??2018-02-12?07:53??data\TanCorpMinTest\人才\103.txt
?????文件????????4155??2018-02-12?07:53??data\TanCorpMinTest\人才\116.txt
?????文件????????1712??2018-02-12?07:53??data\TanCorpMinTest\人才\129.txt
?????文件????????1619??2018-02-12?07:53??data\TanCorpMinTest\人才\141.txt
?????文件????????1583??2018-02-12?07:53??data\TanCorpMinTest\人才\154.txt
?????文件????????8161??2018-02-12?07:53??data\TanCorpMinTest\人才\167.txt
?????文件????????1394??2018-02-12?07:53??data\TanCorpMinTest\人才\18.txt
?????文件????????1610??2018-02-12?07:53??data\TanCorpMinTest\人才\192.txt
?????文件????????3382??2018-02-12?07:53??data\TanCorpMinTest\人才\204.txt
?????文件????????3071??2018-02-12?07:53??data\TanCorpMinTest\人才\217.txt
?????文件????????2260??2018-02-12?07:53??data\TanCorpMinTest\人才\23.txt
?????文件????????2024??2018-02-12?07:53??data\TanCorpMinTest\人才\242.txt
?????文件????????4243??2018-02-12?07:53??data\TanCorpMinTest\人才\255.txt
?????文件????????2240??2018-02-12?07:53??data\TanCorpMinTest\人才\268.txt
?????文件????????4484??2018-02-12?07:53??data\TanCorpMinTest\人才\280.txt
?????文件????????3504??2018-02-12?07:53??data\TanCorpMinTest\人才\293.txt
?????文件????????5240??2018-02-12?07:53??data\TanCorpMinTest\人才\305.txt
?????文件????????1071??2018-02-12?07:53??data\TanCorpMinTest\人才\318.txt
?????文件????????4464??2018-02-12?07:53??data\TanCorpMinTest\人才\330.txt
?????文件???????16049??2018-02-12?07:53??data\TanCorpMinTest\人才\343.txt
?????文件????????4722??2018-02-12?07:53??data\TanCorpMinTest\人才\356.txt
?????文件????????6241??2018-02-12?07:53??data\TanCorpMinTest\人才\369.txt
?????文件????????3281??2018-02-12?07:53??data\TanCorpMinTest\人才\381.txt
?????文件????????2814??2018-02-12?07:53??data\TanCorpMinTest\人才\394.txt
............此處省略5751個文件信息

評論

共有 條評論

相關(guān)資源