91av视频/亚洲h视频/操亚洲美女/外国一级黄色毛片 - 国产三级三级三级三级

  • 大小: 7KB
    文件類型: .py
    金幣: 1
    下載: 0 次
    發(fā)布日期: 2021-06-15
  • 語言: Python
  • 標(biāo)簽:

資源簡介

一個多線程的Python爬蟲,使用threading,queue模塊實現(xiàn)線程同步

資源截圖

代碼片段和文件信息

“““
A?web?crawler?for?tao?nv?lang.?Grabing?all?MM‘s?images?with?hierarchical?structure:
MM‘s?name(folder)?-->?MM‘s?albums(folder)?-->?MM‘s?images(files)

Run?the?code?with?three?steps:
1.Create?a?instance?of?web?crawler?class
2.Select?a?range?of?pages?with?start?index?and?end?index.
3.Run?a?class?function?named?save_all(start?end)

All?images?are?going?to?save?in?your?current?path.

Please?notice?there?are?including?chinese?character.?It?may?cause?unreadable?folder?names
If?your?computer?do?not?support?chinese?language.

Author: Alien_gmx
Date: 10/26/2015
version:?2.0
version?description:
Version?2.0?is?a?achivement?of?mutile?threads?crawler

1.?Modify?inner?funtion:?get_album_contents?and?get_images_urls.
2.?Implement?one?producer?and?one?consumer.
3.?Set?the?max?queue?size?=?12
“““
import?urllib
import?urllib2
import?re
import?os
from?threading?import?Thread?Lock?Condition
from?Queue?import?Queue


global?images_index
images_index?=?0
global?images_q
images_q?=?Queue(maxsize?=?12)
global?threadLock
threadLock?=?Lock()
#global?condition
#condition?=?Condition()
global?root_path
root_path?=?os.getcwd()

class?crawler_producer(Thread):
????def?__init__(self?start_index?end_index):
????????Thread.__init__(self)
????????self.mainurl?=?‘http://mm.taobao.com/json/request_top_list.htm‘
????????self.start_index?=?start_index
????????self.end_index?=?end_index

????#get?page?information?by?page?index
????def?get_page(self?index):
????????url?=?self.mainurl?+?‘?page=‘?+?str(index)
????????req?=?urllib2.Request(url)
????????resp?=?urllib2.urlopen(req)
????????#?GBK?is?the?encode?method?of?chinese?character
????????return?resp.read().decode(‘gbk‘)

????#get?page?information?by?single?url
????def?get_details(self?url):
????????resp?=?urllib2.urlopen(url)
????????return?resp.read().decode(‘gbk‘)

????#get?main?page?contents?including?[0]:url?of?mm‘s?page?[1]:user?id?[2]:name?[3]:age?[4]:city
????def?get_contents(self?index):
????????page?=?self.get_page(index)
????????pattern?=?re.compile(‘(.*?).*?(.*?).*?(.*?)
?????????????????????????????re.S)
????????items?=?re.findall(pattern?page)
????????content?=?[]
????????for?i?in?items:
????????????content.append([i[0]i[1]i[2]i[3]i[4]])
????????return?content

????def?new_folder(self?path?name):
????????os.chdir(path)
????????name?=?name.strip()
????????name?=?name.strip(‘.‘)?#.?cannot?including?in?a?name?of?folder
????????is_exists?=?os.path.exists(name)
????????if?not?is_exists:
????????????print?‘Creating?a?new?folder:‘?name
????????????print?path
????????????os.makedirs(name)
????????else:
????????????print?‘The?folder?named‘?name?‘exists‘

????????return_path?=?path?+?‘/‘?+?name
????????return?return_path

????def?get_images_urls(self?user_id?album_id?path_album):
????????ma

評論

共有 條評論