Nutch的未知篩選和規範化

我有我想要測試使用Nutch的刮的URL列表...具體的網址和無爬行的名單..Nutch的未知篩選和規範化

我指的是這個post禁用爬行..

而且我發現我的5個測試網址在歸一化和過濾後變成了0。

$:~/apache-nutch-1.7$ bin/nutch crawl urls -dir crawl -depth 3 -topN 1000 
solrUrl is not set, indexing will be skipped... 
crawl started in: crawl 
rootUrlDir = urls 
threads = 10 
depth = 3 
solrUrl=null 
topN = 1000 
Injector: starting at 2013-12-18 23:07:32 
Injector: crawlDb: crawl/crawldb 
Injector: urlDir: urls 
Injector: Converting injected urls to crawl db entries. 
Injector: total number of urls rejected by filters: 5 
Injector: total number of urls injected after normalization and filtering: 0 
Injector: Merging injected urls into crawl db. 
Injector: finished at 2013-12-18 23:07:39, elapsed: 00:00:06 
Generator: starting at 2013-12-18 23:07:39 
Generator: Selecting best-scoring urls due for fetch. 
Generator: filtering: true 
Generator: normalizing: true 
Generator: topN: 1000 
Generator: jobtracker is 'local', generating exactly one partition. 
Generator: 0 records selected for fetching, exiting ... 
Stopping at depth=0 - no more URLs to fetch. 
No URLs to fetch - check your seed list and URL filters. 
crawl finished: crawl

而實際上我離開過濾器和規範化爲默認我的猜測沒有過濾嘴的..

誰能幫助我明白是怎麼回事？

Injector: total number of urls rejected by filters: 5

誰能告訴我哪個配置文件我應該改變，以除去上面

而且我的測試網址，該行提到的「過濾器」是這樣的：

http://example.com/store/em?action=products&cat=1&catalogId=500201&No=0 
http://example.com/store/em?action=products&cat=1&catalogId=500201&No=25 
http://example.com/store/em?action=products&cat=1&catalogId=500201&No=50 
http://example.com/store/em?action=products&cat=1&catalogId=500201&No=75 
http://example.com/store/em?action=products&cat=1&catalogId=500201&No=100

來源

2013-12-18 B.Mr.W.

http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1/conf/

該文件是regex-urlfilter.txt（複製模板）。看看那裏的正則表達式。特別是行：

# skip URLs containing certain characters as probable queries, etc. 
-[?*[email protected]=]

肯定是篩選出你的網址考慮他們「查詢等」。

來源

2014-01-02 15:51:49

Nutch的未知篩選和規範化

回答

相關問題