多線程爬蟲

大小: 7KB

文件類型: .py

金幣: 1

下載: 0 次

發(fā)布日期: 2021-06-15
語言: Python
標(biāo)簽:

高速下載

資源簡介

一個多線程的Python爬蟲，使用threading，queue模塊實現(xiàn)線程同步

資源截圖

小圖大圖

代碼片段和文件信息

“““
A?web?crawler?for?tao?nv?lang.?Grabing?all?MM‘s?images?with?hierarchical?structure:
MM‘s?name（folder）?-->?MM‘s?albums（folder）?-->?MM‘s?images（files）

Run?the?code?with?three?steps:
1.Create?a?instance?of?web?crawler?class
2.Select?a?range?of?pages?with?start?index?and?end?index.
3.Run?a?class?function?named?save_all（start?end）

All?images?are?going?to?save?in?your?current?path.

Please?notice?there?are?including?chinese?character.?It?may?cause?unreadable?folder?names
If?your?computer?do?not?support?chinese?language.

Author:	Alien_gmx
Date:	10/26/2015
version:?2.0
version?description:
Version?2.0?is?a?achivement?of?mutile?threads?crawler

1.?Modify?inner?funtion:?get_album_contents?and?get_images_urls.
2.?Implement?one?producer?and?one?consumer.
3.?Set?the?max?queue?size?=?12
“““
import?urllib
import?urllib2
import?re
import?os
from?threading?import?Thread?Lock?Condition
from?Queue?import?Queue


global?images_index
images_index?=?0
global?images_q
images_q?=?Queue（maxsize?=?12）
global?threadLock
threadLock?=?Lock（）
#global?condition
#condition?=?Condition（）
global?root_path
root_path?=?os.getcwd（）

class?crawler_producer（Thread）:
????def?__init__（self?start_index?end_index）:
????????Thread.__init__（self）
????????self.mainurl?=?‘http://mm.taobao.com/json/request_top_list.htm‘
????????self.start_index?=?start_index
????????self.end_index?=?end_index

????#get?page?information?by?page?index
????def?get_page（self?index）:
????????url?=?self.mainurl?+?‘?page=‘?+?str（index）
????????req?=?urllib2.Request（url）
????????resp?=?urllib2.urlopen（req）
????????#?GBK?is?the?encode?method?of?chinese?character
????????return?resp.read（）.decode（‘gbk‘）

????#get?page?information?by?single?url
????def?get_details（self?url）:
????????resp?=?urllib2.urlopen（url）
????????return?resp.read（）.decode（‘gbk‘）

????#get?main?page?contents?including?[0]:url?of?mm‘s?page?[1]:user?id?[2]:name?[3]:age?[4]:city
????def?get_contents（self?index）:
????????page?=?self.get_page（index）
????????pattern?=?re.compile（‘（.*?）.*?（.*?）.*?（.*?）‘
?????????????????????????????re.S）
????????items?=?re.findall（pattern?page）
????????content?=?[]
????????for?i?in?items:
????????????content.append（[i[0]i[1]i[2]i[3]i[4]]）
????????return?content

????def?new_folder（self?path?name）:
????????os.chdir（path）
????????name?=?name.strip（）
????????name?=?name.strip（‘.‘）?#.?cannot?including?in?a?name?of?folder
????????is_exists?=?os.path.exists（name）
????????if?not?is_exists:
????????????print?‘Creating?a?new?folder:‘?name
????????????print?path
????????????os.makedirs（name）
????????else:
????????????print?‘The?folder?named‘?name?‘exists‘

????????return_path?=?path?+?‘/‘?+?name
????????return?return_path

????def?get_images_urls（self?user_id?album_id?path_album）:
????????ma

上一篇：python簡單實現(xiàn)-中國象棋
下一篇：Python魔鬼訓(xùn)練營系列教程

91av视频/亚洲h视频/操亚洲美女/外国一级黄色毛片 - 国产三级三级三级三级

多線程爬蟲

資源簡介

資源截圖

代碼片段和文件信息

評論

相關(guān)資源