91av视频/亚洲h视频/操亚洲美女/外国一级黄色毛片 - 国产三级三级三级三级

  • 大小: 800B
    文件類型: .zip
    金幣: 2
    下載: 0 次
    發(fā)布日期: 2021-05-12
  • 語言: Python
  • 標(biāo)簽: python??spider??

資源簡介

開發(fā)爬蟲中,始終受困于爬蟲的效率問題,后多方查看資料,根據(jù)已掌握的信息編寫了該項(xiàng)目,此demo完全基于python的協(xié)程思想實(shí)現(xiàn),不管是自己研究用還是應(yīng)用到自己的項(xiàng)目中都可以。需要的小伙伴快下載來使用吧

資源截圖

代碼片段和文件信息

from?bs4?import?BeautifulSoup
import?requests
from?urllib.parse?import?urlparse

start_url?=?‘https://www.cnblogs.com‘
trust_host?=?‘www.cnblogs.com‘
ignore_path?=?[]
history_urls?=?[]


def?parse_html(html):
????soup?=?BeautifulSoup(html?“l(fā)xml“)
????print(soup.title)
????links?=?soup.find_all(‘a(chǎn)‘?href=True)
????return?(a[‘href‘]?for?a?in?links?if?a[‘href‘])


def?parse_url(url):
????url?=?url.strip()

????if?url.find(‘#‘)?>=?0:
????????url?=?url.split(‘#‘)[0]
????if?not?url:
????????return?None
????if?url.find(‘javascript:‘)?>=?0:
????????return?None

????for?f?in?ignore_path:
????????if?f?in?url:
????????????return?None
????if?url.find(‘http‘)?????????url?=?start_url?+?url
????????return?url
????parse?=?urlparse(url)
????if?parse.hostname?==?trust_host:
????????return?url


def?consumer():
????html?=?‘‘
????while?True:
????????url?=?yield?html

????????if?url:
????????????print(‘[CONSUMER]?Consuming?%s...‘?%?url)
????????????rsp?=?requests.get(url)
????????????html?=?rsp.content


def?produce(c):
????next(c)
????def?do_work(urls):
????????for?u?in?urls:
????????????if?u?not?in?history_urls:
????????????????history_urls.append(u)
????????????????print(‘[PRODUCER]?Producing?%s...‘?%?u)
????????????????html?=?c.send(u)
????????????????results?=?parse_html(html)
????????????????work_urls?=?(x?for?x?in?map(parse_url?results)?if?x)
????????????????do_work(work_urls)
????do_work([start_url])
????c.close()

if?__name__?==?‘__main__‘:
????c?=?consumer()
????produce(c)
????print(len(history_urls))

?屬性????????????大小?????日期????時(shí)間???名稱
-----------?---------??----------?-----??----
?????文件????????1588??2019-10-21?14:10??python_spider.py

評(píng)論

共有 條評(píng)論