1
我有我想要測試使用Nutch的刮的URL列表...具體的網址和無爬行的名單..Nutch的未知篩選和規範化
我指的是這個post禁用爬行..
而且我發現我的5個測試網址在歸一化和過濾後變成了0。
$:~/apache-nutch-1.7$ bin/nutch crawl urls -dir crawl -depth 3 -topN 1000
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 1000
Injector: starting at 2013-12-18 23:07:32
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 5
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-12-18 23:07:39, elapsed: 00:00:06
Generator: starting at 2013-12-18 23:07:39
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl
而實際上我離開過濾器和規範化爲默認我的猜測沒有過濾嘴的..
誰能幫助我明白是怎麼回事?
Injector: total number of urls rejected by filters: 5
誰能告訴我哪個配置文件我應該改變,以除去上面
而且我的測試網址,該行提到的「過濾器」是這樣的:
http://example.com/store/em?action=products&cat=1&catalogId=500201&No=0
http://example.com/store/em?action=products&cat=1&catalogId=500201&No=25
http://example.com/store/em?action=products&cat=1&catalogId=500201&No=50
http://example.com/store/em?action=products&cat=1&catalogId=500201&No=75
http://example.com/store/em?action=products&cat=1&catalogId=500201&No=100