資源簡介
測試了下:抓取單頁沒事,批量抓取暫時沒發現在哪里。。。
網絡爬蟲程序源碼
這是一款用 C# 編寫的網絡爬蟲
主要特性有:
可配置:線程數、線程等待時間,連接超時時間,可爬取文件類型和優先級、下載目錄等。
狀態欄顯示統計信息:排入隊列URL數,已下載文件數,已下載總字節數,CPU使用率和可用內存等。
有偏好的爬蟲:可針對爬取的資源類型設置不同的優先級。
健壯性:十幾項URL正規化策略以排除冗余下載、爬蟲陷阱避免策略的使用等、多種策略以解析相對路徑等。
較好的性能:基于正則表達式的頁面解析、適度加鎖、維持HTTP連接等。
今后有空可能加入的特性:
新特性 介紹
爬取文件用Berkeley DB存儲 提高性能: 常用操作系統不善于處理大量小文件
基于URL Ranking的優先級隊列 主題爬蟲: 機器學習算法對鏈接與主題相關度進行評估,并按照得出的優先級順序進行爬取
爬蟲禮儀 遵循爬蟲禁止協議、以及避免對服務器資源的過度使用等
性能優化 用UDP取代封裝好的HttpWebRequest/Response
DNS緩存
異步的DNS地址解析
硬盤緩存或內存數據庫以避免頻繁的磁盤尋道
分布式爬蟲以擴展單機能力(CPU、內存和硬盤訪問)
代碼片段和文件信息
using?System;
using?System.Collections.Generic;
using?System.ComponentModel;
using?System.Data;
using?System.Drawing;
using?System.Linq;
using?System.Text;
using?System.Windows.Forms;
using?NWebCrawlerLib;
using?System.Diagnostics;
//?源碼下載?www.51aspx.com?
namespace?NWebCrawler
{
????public?partial?class?MainForm?:?Form
????{
????????#region?Fields
????????private?PerformanceCounter?m_cpuCounter;
????????private?PerformanceCounter?m_ramCounter;
????????private?Downloader?m_downloader;
????????#endregion
????????#region?Properties
????????//?number?of?bytes?downloaded
????????private?int?nByteCount;
????????private?int?ByteCount
????????{
????????????get?{?return?nByteCount;?}
????????????set
????????????{
????????????????nByteCount?=?value;
???????????
?屬性????????????大小?????日期????時間???名稱
-----------?---------??----------?-----??----
?????文件????????162??2010-01-05?10:27??win_NWebCrawler\bin\config.ini
?????文件??????36654??2010-01-05?10:40??win_NWebCrawler\bin\download\0003be8238c8302e17c799d9f5d65876.gif
?????文件??????73958??2010-01-05?10:40??win_NWebCrawler\bin\download\0718ad68487fa12de0cc75b20f7be03c.html;?charset=utf-8
?????文件??????48666??2010-01-05?10:40??win_NWebCrawler\bin\download\082e9d970f371da4f6e74dbe2c97f6e2.html;?charset=utf-8
?????文件????????317??2010-01-05?10:41??win_NWebCrawler\bin\download\132949602460dfebc35da092329cba0c.gif
?????文件???????4334??2010-01-05?10:47??win_NWebCrawler\bin\download\1695505243ceaa9c68e5a00061d1763f.ja
?????文件??????15297??2010-01-05?10:40??win_NWebCrawler\bin\download\1df7133090a0d07c5cec8fccbf6fd8dd.html;?charset=utf-8
?????文件????????164??2010-01-05?10:40??win_NWebCrawler\bin\download\203557adfb69f0b4da4e237df2c0899a.html;?charset=gb2312
?????文件??????14650??2010-01-05?10:40??win_NWebCrawler\bin\download\23e5f50b0b42662c6694e574e74835cd.html;?charset=utf-8
?????文件??????63579??2010-01-05?10:41??win_NWebCrawler\bin\download\24eebf7019dc355f064372d6a889c60a.html;?charset=gb2312
?????文件??????54471??2010-01-05?10:41??win_NWebCrawler\bin\download\27439efce81b9ca84182d54aa411418e.html;?charset=gb2312
?????文件??????36711??2010-01-05?10:40??win_NWebCrawler\bin\download\2a2f02ca86459cde185fc8e8e9045bed.html;?charset=utf-8
?????文件????????287??2010-01-05?10:40??win_NWebCrawler\bin\download\349427e49e96cbca35651e55ef94353d.gif
?????文件?????108468??2010-01-05?10:40??win_NWebCrawler\bin\download\3891570720e771c847e5ac23e28aa6cc.html
?????文件????????322??2010-01-05?10:41??win_NWebCrawler\bin\download\3ff2932f670fc24203b1290df195dabf.gif
?????文件?????????10??2010-01-05?10:46??win_NWebCrawler\bin\download\417d9e708c95da24b75705338598087f.html
?????文件??????47067??2010-01-05?10:41??win_NWebCrawler\bin\download\44b19dec343bee7540d2e563399518f6.html;?charset=gb2312
?????文件??????22221??2010-01-05?10:40??win_NWebCrawler\bin\download\46e1c646c9965ce2581be0e2baa182cf.html;?charset=utf-8
?????文件???????4962??2010-01-05?10:46??win_NWebCrawler\bin\download\48bfe5c4818bc6d7d0a86b7c5d5a963a.ja
?????文件??????11484??2010-01-05?10:46??win_NWebCrawler\bin\download\4cef95f512517e118d0427cdf40d8d91.ja
?????文件??????48471??2010-01-05?10:40??win_NWebCrawler\bin\download\54cd270476c08dc49137cc587d5420e7.html;?charset=utf-8
?????文件????????305??2010-01-05?10:40??win_NWebCrawler\bin\download\5ae7c8b442091b3c740b5f89f2202977.gif
?????文件??????46870??2010-01-05?10:41??win_NWebCrawler\bin\download\5f194c03340af2c82af0806b4cd95f44.html;?charset=gb2312
?????文件??????39917??2010-01-05?10:46??win_NWebCrawler\bin\download\6a78a05748d064e4491b674a391174c7.ja
?????文件??????74477??2010-01-05?10:40??win_NWebCrawler\bin\download\6ba086f85f3602a364dae60f740138c5.html;?charset=gb2312
?????文件??????93739??2010-01-05?10:29??win_NWebCrawler\bin\download\73e9259e079ac68519bd2cf67af06c13.html;?charset=utf-8
?????文件???????1570??2010-01-05?10:46??win_NWebCrawler\bin\download\753a67d9417f20f83e1dce17d6146f85.gif
?????文件???????3440??2010-01-05?10:40??win_NWebCrawler\bin\download\767223508f1bd57304d84720065f9ee8.x-ja
?????文件?????103862??2010-01-05?10:41??win_NWebCrawler\bin\download\7780c2d0134fad8b7a05a95d0f7b3378.html;?charset=gb2312
?????文件????????205??2010-01-05?10:47??win_NWebCrawler\bin\download\7a6721fd05029de13a9df0e2a0948f25.html;?charset=UTF-8
............此處省略250個文件信息
評論
共有 條評論