我想用nutch來抓取網頁,我按照nutch官方網站上的文檔步驟(成功運行抓取,將scheme-solr4.xml複製到solr目錄中)。但是當我運行Nutch 1.3和Solr 4.4.0集成作業失敗
bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
我得到以下錯誤:
Indexer: starting at 2013-08-25 09:17:35
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)
我不得不提的是,Solr的運行,但我無法瀏覽http://localhost:8983/solr/admin
(它重定向我http://localhost:8983/solr/#
)。
另一方面,當我停止solr,我得到同樣的錯誤!有人知道我的設置有什麼問題嗎?
P.S.我抓取網址是:http://localhost/NORC
是你能解決這個問題嗎? – Monodeep