資源簡介
開發(fā)爬蟲中,始終受困于爬蟲的效率問題,后多方查看資料,根據(jù)已掌握的信息編寫了該項(xiàng)目,此demo完全基于python的協(xié)程思想實(shí)現(xiàn),不管是自己研究用還是應(yīng)用到自己的項(xiàng)目中都可以。需要的小伙伴快下載來使用吧

代碼片段和文件信息
from?bs4?import?BeautifulSoup
import?requests
from?urllib.parse?import?urlparse
start_url?=?‘https://www.cnblogs.com‘
trust_host?=?‘www.cnblogs.com‘
ignore_path?=?[]
history_urls?=?[]
def?parse_html(html):
????soup?=?BeautifulSoup(html?“l(fā)xml“)
????print(soup.title)
????links?=?soup.find_all(‘a(chǎn)‘?href=True)
????return?(a[‘href‘]?for?a?in?links?if?a[‘href‘])
def?parse_url(url):
????url?=?url.strip()
????if?url.find(‘#‘)?>=?0:
????????url?=?url.split(‘#‘)[0]
????if?not?url:
????????return?None
????if?url.find(‘javascript:‘)?>=?0:
????????return?None
????for?f?in?ignore_path:
????????if?f?in?url:
????????????return?None
????if?url.find(‘http‘)?0:
????????url?=?start_url?+?url
????????return?url
????parse?=?urlparse(url)
????if?parse.hostname?==?trust_host:
????????return?url
def?consumer():
????html?=?‘‘
????while?True:
????????url?=?yield?html
????????if?url:
????????????print(‘[CONSUMER]?Consuming?%s...‘?%?url)
????????????rsp?=?requests.get(url)
????????????html?=?rsp.content
def?produce(c):
????next(c)
????def?do_work(urls):
????????for?u?in?urls:
????????????if?u?not?in?history_urls:
????????????????history_urls.append(u)
????????????????print(‘[PRODUCER]?Producing?%s...‘?%?u)
????????????????html?=?c.send(u)
????????????????results?=?parse_html(html)
????????????????work_urls?=?(x?for?x?in?map(parse_url?results)?if?x)
????????????????do_work(work_urls)
????do_work([start_url])
????c.close()
if?__name__?==?‘__main__‘:
????c?=?consumer()
????produce(c)
????print(len(history_urls))
?屬性????????????大小?????日期????時(shí)間???名稱
-----------?---------??----------?-----??----
?????文件????????1588??2019-10-21?14:10??python_spider.py
評(píng)論
共有 條評(píng)論