2017-01-03 118 views
0

我想使用Apache Nutch 1.12抓取站點並將數據索引到Apache Solr中。我遵循此tutorialNutch抓取不起作用

我seed.txt文件有這個網址http://nutch.apache.org/

在我正則表達式URL過濾器,我有這樣的+^* http://([a-z0-9])* nutch.apache.org/

當我嘗試獲取數據我只能得到我的seed.txt文件中的網址。

Fetcher: starting at 2017-01-03 09:56:23 
Fetcher: segment: crawl/segments/20170103095613 
Fetcher: threads: 10 
Fetcher: time-out divisor: 2 
QueueFeeder finished: total 2 records + hit by time limit :0 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
fetching http://nutch.apache.org/ (queue crawl delay=5000ms) 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Using queue mode : byHost 
Using queue mode : byHost 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Using queue mode : byHost 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Using queue mode : byHost 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Using queue mode : byHost 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Using queue mode : byHost 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Using queue mode : byHost 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Fetcher: throughput threshold: -1 
Fetcher: throughput threshold retries: 5 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
robots.txt whitelist not configured. 
robots.txt whitelist not configured. 
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2 
Thread FetcherThread has no more work available 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=1 
-finishing thread FetcherThread, activeThreads=0 
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0 
-activeThreads=0 

我在這裏失蹤。

+0

遞歸嘗試,生成> Fetch> Parse> Updatedb。看到你的日誌條目瞭解更多詳情 –

回答

0

我試圖再次執行讀取操作,我得到了預期的結果。