91av视频/亚洲h视频/操亚洲美女/外国一级黄色毛片 - 国产三级三级三级三级

資源簡介

內(nèi)容: 根據(jù)已有的的"大眾點(diǎn)評網(wǎng)"酒店主頁的URL地址,自動(dòng)抓取所需要的酒店的名稱、圖片、經(jīng)緯度、酒店價(jià)格、用戶評論數(shù)量以及用戶評論的用戶ID、用戶名字、評分、評論時(shí)間等,并且將爬取成功的內(nèi)容存放到.txt文檔中。 平臺:Python 3.5.3;Eclipse for Pydev 主程序:DianpingSpider.py 注意:設(shè)置了時(shí)間,模擬器等,較為有效地防止大眾點(diǎn)評網(wǎng)的反爬蟲結(jié)束檢測到同一個(gè)IP訪問頻繁而屏蔽爬取,但是未能實(shí)現(xiàn)IP代理。

資源截圖

代碼片段和文件信息

‘‘‘
程序名:???DianPingSpider
功能:????根據(jù)已有的的“大眾點(diǎn)評網(wǎng)“酒店主頁的URL地址,自動(dòng)抓取所需要的酒店的名稱、圖片、經(jīng)緯度、酒店價(jià)格、用戶評論數(shù)量以及用戶評論的用戶ID、用戶名字、評分、評論時(shí)間
語言:????python3.5.2
Created?on?2016-10-30
@author:?Lei?Gaiceong
‘‘‘
import?urllib.request
import?re
import?time
import?picture?#獲取酒店圖片
import?urlspider?#獲取酒店的URL
import?position
import?os
import?random
import?PriceAndScores
?
#?統(tǒng)計(jì)所有酒店的評價(jià)信息,存入文本
def?getRatingAll(fileIn):
????count?=0??????????#?計(jì)數(shù),顯示進(jìn)度
????websitenumber=0
????for?line?in?open(fileIn‘r‘):?????#?逐行讀取并處理文件,即hotel的url
????????count=line.split(‘\t‘)[0]
????????line=line.split(‘\t‘)[1]?
????????websitenumber+=1
????????print(“正在抓取第%s個(gè)網(wǎng)址的酒店信息“%(websitenumber))
????????try:
????????????print(“正在抓取第%s家酒店的信息,網(wǎng)址為%s“%(countline.strip(‘\n‘)))
????????????#?獲取酒店編號
????????????hotelid?=?line.strip(‘\n‘).split(‘/‘)[4]
????????????#print(‘該酒店的hotelid是?:?‘?hotelid)
?
????????????#?拼湊出該酒店第一頁“評論頁面“的url
????????????url?=?line.strip(‘\n‘)?+?“/review_more“
????????????#print(‘該酒店第一頁“評論頁面“的url是‘?url)
?????????????
????????????#?模擬瀏覽器打開url
????????????headers?=?(‘User-Agent‘?‘Mozilla/5.0?(Windows?NT?6.3;?WOW64;?Trident/7.0;?rv:11.0)?like?Gecko‘)
????????????opener?=?urllib.request.build_opener()
????????????opener.addheaders?=?[headers]
????????????data?=?opener.open(url).read()
????????????#print(data)??當(dāng)訪問大眾點(diǎn)評網(wǎng)過于次數(shù)頻繁的時(shí)候,大眾點(diǎn)評網(wǎng)的反爬蟲技術(shù)會封鎖本機(jī)的IP地址,此時(shí)data就會出現(xiàn)異常,無法打印正常的html
????????????data?=?data.decode(‘utf-8‘?‘ignore‘)
????????????#print(data)
????????????
????????????#?獲取酒店用戶評論數(shù)目
????????????rate_number?=?re.compile(r‘全部點(diǎn)評\((.*?)\)‘?re.DOTALL).findall(data)
????????????rate_number?=?int(‘‘.join(rate_number))??#?把列表轉(zhuǎn)換為str,把可迭代列表里面的內(nèi)容用‘?’連接起來成為str,再進(jìn)行類型轉(zhuǎn)換
????????????print(“第%d家酒店的評論數(shù)為%s“?%?(websitenumber?rate_number))
?????????????
????????????if(rate_number<100):#若酒店評論數(shù)目少于100條,則不跳過該酒店,不再挖取信息
????????????????continue
?????????????????????????
????????????#?獲取酒店名稱和地址
????????????opener1?=?urllib.request.build_opener()
????????????opener1.addheaders?=?[headers]
????????????hotel_url?=opener1.open(line).read()
????????????hotel_data?=?hotel_url.decode(‘utf-8‘)#ignore是忽略其中有異常的編碼,僅顯示有效的編碼
????????????#print(hotel_data)??#當(dāng)訪問大眾點(diǎn)評網(wǎng)過于次數(shù)頻繁的時(shí)候,大眾點(diǎn)評網(wǎng)的反爬蟲技術(shù)會封鎖本機(jī)的IP地址,此時(shí)hotel_data就會出現(xiàn)異常,無法打印正常的html
????????????#?獲取酒店名稱
????????????shop_name=re.compile(u‘(.*?)????‘re.DOTALL).findall(hotel_data)
????????????shop_name=str(‘‘.join(shop_name))?#類型轉(zhuǎn)換
????????????shop_name=shop_name.strip()?#去掉字符串中的換行符
????????????#?獲取酒店地址
????????????address=re.compile(u‘地址: (.*?)\n?????????‘re.DOTALL).findall(hotel_data)
????????????address=str(‘‘.join(address))
????????????#print(address)
????????????
????????????#?獲取酒店經(jīng)緯度
????????????poi=re.compile(r‘poi:?\“(.*?)\“‘re.DOTALL).findall(hotel_data)
????????????poi=str(‘‘.join(poi))
????????????(longitudelatitude)=position.getPosition(poi)
????????????print(“l(fā)ongitude:%s°Elatitude:%s°N“%(longitudelati

?屬性????????????大小?????日期????時(shí)間???名稱
-----------?---------??----------?-----??----
?????目錄???????????0??2016-10-30?21:36??DianPingSpider\
?????文件????????9368??2016-10-30?21:45??DianPingSpider\DianpingSpider.py
?????目錄???????????0??2016-10-30?21:34??DianPingSpider\hotel\
?????目錄???????????0??2016-10-30?21:34??DianPingSpider\image\
?????文件????????1038??2016-10-10?14:36??DianPingSpider\picture.py
?????文件????????1426??2016-10-30?20:50??DianPingSpider\position.py
?????文件????????1080??2016-10-30?00:31??DianPingSpider\PriceAndScores.py
?????文件????????1757??2016-10-30?21:00??DianPingSpider\urlspider.py
?????目錄???????????0??2016-10-30?21:05??DianPingSpider\__pycache__\
?????文件?????????977??2016-10-10?14:36??DianPingSpider\__pycache__\picture.cpython-35.pyc
?????文件????????1547??2016-10-30?20:54??DianPingSpider\__pycache__\position.cpython-35.pyc
?????文件????????1081??2016-10-30?00:56??DianPingSpider\__pycache__\PriceAndScores.cpython-35.pyc
?????文件?????????388??2016-10-23?15:50??DianPingSpider\__pycache__\test.cpython-35.pyc
?????文件????????1539??2016-10-22?14:59??DianPingSpider\__pycache__\urlspider.cpython-35.pyc
?????文件????????5876??2016-10-23?15:50??DianPingSpider\__pycache__\__init__.cpython-35.pyc

評論

共有 條評論