2014-01-15 33 views
0
$ ./nutch crawl urls -solr `http://localhost:8080/solr/` -depth 2 -topN 3 
cygpath: can't convert empty path 
crawl started in: crawl-20140115213017 
rootUrlDir = urls 
threads = 10 
depth = 2 
solrUrl=`http://localhost:8080/solr/` 
topN = 3 
Injector: starting at 2014-01-15 21:30:17 
Injector: crawlDb: crawl-20140115213017/crawldb 
Injector: urlDir: urls 
Injector: Converting injected urls to crawl db entries. 
Injector: Merging injected urls into crawl db. 
Injector: finished at 2014-01-15 21:30:21, elapsed: 00:00:03 
Generator: starting at 2014-01-15 21:30:21 
Generator: Selecting best-scoring urls due for fetch. 
Generator: filtering: true 
Generator: normalizing: true 
Generator: topN: 3 
Generator: jobtracker is 'local', generating exactly one partition. 
Generator: Partitioning selected urls for politeness. 
Generator: segment: crawl-20140115213017/segments/20140115213024 
Generator: finished at 2014-01-15 21:30:26, elapsed: 00:00:04 
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. 
Fetcher: starting at 2014-01-15 21:30:26 
Fetcher: segment: crawl-20140115213017/segments/20140115213024 
Using queue mode : byHost 
Fetcher: threads: 10 
Fetcher: time-out divisor: 2 
QueueFeeder finished: total 1 records + hit by time limit :0 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Fetcher: throughput threshold: -1 
Fetcher: throughput threshold retries: 5 
fetching `http://www.parkinson.org/` 
-finishing thread FetcherThread, activeThreads=3 
-finishing thread FetcherThread, activeThreads=2 
-finishing thread FetcherThread, activeThreads=7 
-finishing thread FetcherThread, activeThreads=6 
-finishing thread FetcherThread, activeThreads=5 
-finishing thread FetcherThread, activeThreads=4 
-finishing thread FetcherThread, activeThreads=3 
-finishing thread FetcherThread, activeThreads=2 
-finishing thread FetcherThread, activeThreads=1 
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 
-finishing thread FetcherThread, activeThreads=0 
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 
-activeThreads=0 
Fetcher: finished at 2014-01-15 21:30:32, elapsed: 00:00:06 
ParseSegment: starting at 2014-01-15 21:30:32 
ParseSegment: segment: crawl-20140115213017/segments/20140115213024 
Parsing: `http://www.parkinson.org/` 
ParseSegment: finished at 2014-01-15 21:30:34, elapsed: 00:00:01 
CrawlDb update: starting at 2014-01-15 21:30:34 
CrawlDb update: db: crawl-20140115213017/crawldb 
CrawlDb update: segments: [crawl-20140115213017/segments/20140115213024] 
CrawlDb update: additions allowed: true 
CrawlDb update: URL normalizing: true 
CrawlDb update: URL filtering: true 
CrawlDb update: 404 purging: false 
CrawlDb update: Merging segment data into db. 
CrawlDb update: finished at 2014-01-15 21:30:36, elapsed: 00:00:01 
Generator: starting at 2014-01-15 21:30:36 
Generator: Selecting best-scoring urls due for fetch. 
Generator: filtering: true 
Generator: normalizing: true 
Generator: topN: 3 
Generator: jobtracker is 'local', generating exactly one partition. 
Generator: Partitioning selected urls for politeness. 
Generator: segment: crawl-20140115213017/segments/20140115213038 
Generator: finished at 2014-01-15 21:30:39, elapsed: 00:00:03 
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. 
Fetcher: starting at 2014-01-15 21:30:39 
Fetcher: segment: crawl-20140115213017/segments/20140115213038 
Using queue mode : byHost 
Fetcher: threads: 10 
Fetcher: time-out divisor: 2 
QueueFeeder finished: total 3 records + hit by time limit :0 
Using queue mode : byHost 
Using queue mode : byHost 
fetching `http://forum.parkinson.org/` 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Fetcher: throughput threshold: -1 
Fetcher: throughput threshold retries: 5 
fetching `http://twitter.com/ParkinsonDotOrg` 
fetching `http://www.youtube.com/user/NPFGuru` 
-finishing thread FetcherThread, activeThreads=9 
-finishing thread FetcherThread, activeThreads=8 
-finishing thread FetcherThread, activeThreads=7 
-finishing thread FetcherThread, activeThreads=6 
-finishing thread FetcherThread, activeThreads=5 
-finishing thread FetcherThread, activeThreads=4 
-finishing thread FetcherThread, activeThreads=3 
-finishing thread FetcherThread, activeThreads=2 
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 
-finishing thread FetcherThread, activeThreads=1 
-finishing thread FetcherThread, activeThreads=0 
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 
-activeThreads=0 
Fetcher: finished at 2014-01-15 21:30:44, elapsed: 00:00:04 
ParseSegment: starting at 2014-01-15 21:30:44 
ParseSegment: segment: crawl-20140115213017/segments/20140115213038 
Parsing: `http://forum.parkinson.org/` 
ParseSegment: finished at 2014-01-15 21:30:45, elapsed: 00:00:01 
CrawlDb update: starting at 2014-01-15 21:30:45 
CrawlDb update: db: crawl-20140115213017/crawldb 
CrawlDb update: segments: [crawl-20140115213017/segments/20140115213038] 
CrawlDb update: additions allowed: true 
CrawlDb update: URL normalizing: true 
CrawlDb update: URL filtering: true 
CrawlDb update: 404 purging: false 
CrawlDb update: Merging segment data into db. 
CrawlDb update: finished at 2014-01-15 21:30:46, elapsed: 00:00:01 
LinkDb: starting at 2014-01-15 21:30:46 
LinkDb: linkdb: crawl-20140115213017/linkdb 
LinkDb: URL normalize: true 
LinkDb: URL filter: tr`enter code here`ue 
LinkDb: adding segment: file:/C:/cygwin/home/nutch/runtime/local/bin/crawl-20140115213017/segments/20140115213024 
LinkDb: adding segment: file:/C:/cygwin/home/nutch/runtime/local/bin/crawl-20140115213017/segments/20140115213038 
LinkDb: finished at 2014-01-15 21:30:47, elapsed: 00:00:01 
SolrIndexer: starting at 2014-01-15 21:30:47 
Adding 2 documents 
java.io.IOException: Job failed! 
SolrDeleteDuplicates: starting at 2014-01-15 21:30:52 
SolrDeleteDuplicates: Solr url: `http://localhost:8080/solr/` 
SolrDeleteDuplicates: finished at 2014-01-15 21:30:53, elapsed: 00:00:01 
crawl finished: crawl-20140115213017 

錯誤*添加2個文檔java.io.IOException:作業失敗! (solr 3.4,使用Cygwin的窗口上的nutch 1.4 bin) 我是Apache的新手...需要一些幫助 嘗試將搜尋到的數據發送到solr,但出現錯誤「java.io.IOException:作業失敗!錯誤*添加2個文檔java.io.IOException:作業失敗! (solr 3.4,使用Cygwin的窗口上的nutch 1.4 bin)

+0

我認爲這是關係到配置你的solr。看看你的solr日誌(或者如果在那裏發生錯誤,就把它發佈到這裏)。另外,請檢查您的nutch日誌(在nutch/logs目錄中)。 – tahagh

回答

0

聽起來像Solr和Nutch的模式文件不匹配。看看這個帖子,我使用Solr的4.3,但我不覺得它不應該是太不一樣了

http://amac4.blogspot.com/2013/07/configuring-nutch-to-crawl-urls.html

日誌文件有有關該問題的更詳細的信息,所以你可以在這裏發表他們。

+0

仍然不知道什麼問題是...但它現在解決..我只是將cyrwin/home/solr的solr dirctory更改爲C:/ solr 並解決了問題。 現在任何人都可以給我鏈接,這有助於我將TIKA與solr 3.4和nutch 1.4 bin集成以及哪個版本適合或兼容? – user2682833

+0

同一網站上有關於使用Solr設置Tika的信息。它的Solr 4.3與3.4相反,但其中的大部分內容都是相同的 –

0

你的命令似乎是錯誤的。它應該是: $ ./nutch抓取網址-dir newCrawl -solr http://localhost:8080/solr/ -depth 3 -topN 5

你的錯誤:沒有把 「-dir」

相關問題