資源簡介
Python爬蟲爬取校內論壇標題,爬取某板塊一千多頁的標題,并將其中關鍵詞整理成自定義詞云圖

代碼片段和文件信息
#encoding=utf-8
#中文詞云的提取
#爬蟲的模擬登錄
import?requests
import?re
import?oscodecs
from?bs4?import?BeautifulSoup?
#requests,通過cookies實現自動登錄,用fiddler抓取登錄信息
cookie=‘‘‘BoardList=BoardID=Show;?ASPSESSIONIDASRBTSBT=OEMCBLKCNHKEIMBINCFEBEND;\
upNum=0;aspsky=username=next%E9%A3%9E%E7%BF%94%E5%95%8A&usercookies=3&userid=461137\
&useranony=&userhidden=2&password=aecc6c796669a52f‘‘‘
header={
?‘User-Agent‘:‘Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.36?\
?(KHTML?like?Gecko)?Chrome/45.0.2454.101?Safari/537.36‘
?‘Connection‘:‘keep-alive‘
?‘Accept‘:‘text/htmlapplication/xhtml+xmlapplication/xml;q=0.9image/webp*/*;q=0.8‘
?‘Cookie‘:cookie}
def?get_title(url):
????s?=?requests.session()
????h?=?s.get(urlheaders=header)?#直接登錄
????soup?=?BeautifulSoup(h.text‘html.parser‘)
????qurl?=?soup.select(‘.tablebody1‘)
????for?i?in?range(8len(qurl)):
????????res?=?qurl[i].text.strip()#res?是字符串
????????if(len(res)>9):#去掉活躍度
????????????res1?=?res.split(?)#去掉回帖數
????????????print(res1[0])
????????????f?=?codecs.open(“result.txt“‘a‘‘utf8‘)
????????????f.write(res1[0]+‘\r\n‘)
????????????f.close
for?i?in?range(11000):
????url?=?‘http://www.cc98.org/list.asp?boardid=182&page=‘+str(i)
????get_title(url)?#每循環一次,得到一個list
?屬性????????????大小?????日期????時間???名稱
-----------?---------??----------?-----??----
?????目錄???????????0??2017-04-16?16:15??python爬蟲-詞云實戰\
?????文件????????1388??2017-04-16?16:16??python爬蟲-詞云實戰\task6-1.py
?????文件????????1091??2017-04-16?16:16??python爬蟲-詞云實戰\task6-2.py
?????文件????????1137??2017-04-16?16:16??python爬蟲-詞云實戰\task6-3.py
評論
共有 條評論