資源簡介
Python爬蟲抓取網頁新聞數據到sqlserver數據庫,按標題排除重復項,python3.7運行環境
代碼片段和文件信息
#coding=utf-8
‘‘‘
Created?on?2018年10月31日
@author:?lhm
測試代碼
‘‘‘
import?random
import?time
import?requests
import?re
from?bs4?import?BeautifulSoup
import?pyodbc
def?getHTMLText(url):
????try:
????????r?=?requests.get(url?timeout?=?30)
????????r.raise_for_status()
????????#r.encoding?=?‘utf-8‘
????????return?r.text
????except:
????????return?““
‘‘‘
getNewsPakge()
此函數用于獲取News列表頁面的url鏈接
返回值為列表pakge_urls
‘‘‘
def?getNewsPakge():
????pakge_urls?=?[]
????for?i?in?range(112):
????????if?i?!=?1:
????????????url?=?‘http://fund.eastmoney.com/a/cjjyw_‘?+?str(i)?+?‘.html‘
????????else:
????????????url?=?‘http://fund.eastmoney.com/a/cjjyw.html‘
????????print(url)
????????pakge_urls.append(url)
????return?pakge_urls
?
‘‘‘
getNewsUrls()
此函數用于獲取News鏈接用于后面的信息的訪問ur
評論
共有 條評論