資源簡(jiǎn)介
Python爬取淘寶上所有耐克鞋商品并進(jìn)行數(shù)據(jù)分析,有excle,柱狀圖,餅圖,散點(diǎn)圖

代碼片段和文件信息
#?-*-?coding:?utf-8?-*-
“““
Created?on?Tue?Nov?20?20:56:27?2018
@author:?wangf
“““
import?re
import?time
import?requests
import?pandas?as?pd
from?retrying?import?retry
from?concurrent.futures?import?ThreadPoolExecutor
start?=?time.clock()?????#計(jì)時(shí)-開(kāi)始
#plist?為1-100頁(yè)的URL的編號(hào)num?
plist?=?[]???????????
for?i?in?range(1101):???
????j?=?44*(i-1)
????plist.append(j)
listno?=?plist
datatmsp?=?pd.Dataframe(columns=[])
while?True:?
???@retry(stop_max_attempt_number?=?8)?????#設(shè)置最大重試次數(shù)
???def?network_programming(num):???
??????url=‘https://s.taobao.com/search?q=%E8%80%90%E5%85%8B&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20170710\
??????&ie=utf8&sort=sale-desc&style=list&fs=1&filter_tianmao?\
??????=tmall&filter=reserve_price%5B500%2C%5D&bcoffset=0&?????\
??????p4ppushleft=%2C44&s=‘?+?str(num)???
??????web?=?requests.get(url?headers=headers)?????
??????web.encoding?=?‘utf-8‘
??????return?web???
#???多線程???????
???def?multithreading():?????
??????number?=?listno????????#每次爬取未爬取成功的頁(yè)
??????event?=?[]
???
??????with?ThreadPoolExecutor(max_workers=10)?as?executor:
?????????for?result?in?executor.map(network_programming
????????????????????????????????????number?chunksize=10):
?????????????event.append(result)???
??????return?event
???
#???隱藏:修改headers參數(shù)????
???headers?=?{‘User-Agent‘:‘Mozilla/5.0?(Windows?NT?10.0;?WOW64)?\
????????????AppleWebKit/537.36(KHTML?like?Gecko)??\
????????????Chrome/70.0.3538.102?Safari/537.36‘}
???
???listpg?=?[]
???event?=?multithreading()
???for?i?in?event:
??????json?=?re.findall(‘“auctions“:(.*?)“recommendAuctions“‘?i.text)
??????if?len(json):
?????????table?=?pd.read_json(json[0])??????
?????????datatmsp?=?pd.concat([datatmsptable]axis=0ignore_index=True)??
?????????
?????????pg?=?re.findall(‘“pageNum“:(.*?)“p4pbottom_up“‘i.text)[0]
?????????listpg.append(pg)??????#記入每一次爬取成功的頁(yè)碼
???
???lists?=?[]
???for?a?in?listpg:???
???????b?=?44*(int(a)-1)
???????lists.append(b)?????#將爬取成功的頁(yè)碼轉(zhuǎn)為url中的num值
???
???listn?=?listno
???listno?=?[]???????#將本次爬取失敗的頁(yè)記入列表中?用于循環(huán)爬取
???for?p?in?listn:
???????if?p?not?in?lists:
???????????listno.append(p)
???????????
???if?len(listno)?==?0:?????#當(dāng)未爬取頁(yè)數(shù)為0時(shí)?終止循環(huán)!
??????break
??????
datatmsp.to_excel(‘datatmsp.xls‘?index=False)????#導(dǎo)出數(shù)據(jù)為Excel
end?=?time.clock()????#計(jì)時(shí)-結(jié)束
print?(“爬取完成?用時(shí):“?end?-?start‘s‘)?
‘‘‘
二、數(shù)據(jù)清洗、處理:?(此步驟也可以在Excel中完成?再讀入數(shù)據(jù))
‘‘‘
datatmsp?=?pd.read_excel(‘datatmsp.xls‘)?????#讀取爬取的數(shù)據(jù)?
#datatmsp.shape???
?
#?數(shù)據(jù)缺失值分析:?此模塊僅供了解??#見(jiàn)下<圖1>
#?安裝模塊:pip?install?missingno
import?missingno?as?msno
msno.bar(datatmsp.sample(len(datatmsp))figsize=(104))???
#?刪除缺失值過(guò)半的列:?僅供了解?本例中可以不用
half_count?=?len(datatmsp)/2
datatmsp?=?datatmsp.dropna(thresh?=?half_count?axis=1)
#?刪除重復(fù)行:
datatmsp?=?datatmsp.drop_duplicates()???
‘‘‘
說(shuō)明:根據(jù)需求,本案例中我只取了?item_loc?raw_title?view_price?view_sales?這4列數(shù)據(jù),
主要對(duì)?標(biāo)題、區(qū)域、價(jià)格、銷量?進(jìn)行分析,代碼如下:?
‘‘‘
#?取出這4列數(shù)據(jù):
data?=?datatmsp[[‘item_loc‘
?屬性????????????大小?????日期????時(shí)間???名稱
-----------?---------??----------?-----??----
?????文件????4261888??2018-12-24?21:51??淘寶2\datatmsp.xls
?????文件??????53070??2018-12-24?21:51??淘寶2\Price.xlsx
?????文件?????147454??2018-12-24?19:40??淘寶2\Price_max.xlsx
?????文件?????147176??2018-12-24?21:51??淘寶2\Price_mean.xlsx
?????文件?????147442??2018-12-24?19:40??淘寶2\Price_min.xlsx
?????文件?????118093??2018-11-20?20:55??淘寶2\stopwords.xlsx
?????文件??????66325??2018-12-24?21:51??淘寶2\word_count.xlsx
?????文件??????13078??2018-12-24?19:44??淘寶2\淘寶2.py
?????目錄??????????0??2018-12-24?21:51??淘寶2
-----------?---------??----------?-----??----
??????????????4954526????????????????????9
評(píng)論
共有 條評(píng)論