0
我試圖運行Nutch 1.6「bin/crawl」中提供的腳本,該腳本執行以下所有手動步驟以關閉並蜘蛛網站。Nutch bin /抓取腳本失敗 - 手動步驟工作正常
當我執行這些步驟手動一切工作正常和預期(儘管只有一個頁面,但會考慮這一點)
包含URL @種子/創建的文本文件urls.txt
我的網頁建立索引bin/nutch inject crawl_test/crawldb seeds/
bin/nutch generate crawl_test/crawldb crawl_test/segments
export SEGMENT=crawl_test/segments/`ls -tr crawl_test/segments|tail -1`
bin/nutch fetch $SEGMENT -noParsing
bin/nutch parse $SEGMENT
bin/nutch updatedb crawl_test/crawldb $SEGMENT -filter -normalize
bin/nutch invertlinks crawl_test/linkdb -dir crawl_test/segments
bin/nutch solrindex http://dev:8080/solr/ crawl_test/crawldb -linkdb crawl_test/linkdb crawl_test/segments/*
中的bin /爬行腳本給這個錯誤...
Indexing 20130412115759 on SOLR index -> someurl:8080/solr/ SolrIndexer: starting at 2013-04-12 11:58:47 SolrIndexer: deleting gone documents: false SolrIndexer: URL filtering: false SolrIndexer: URL normalizing: false org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch/20130412115759/crawl_fetch Input path does not exist: file:/opt/nutch/20130412115759/crawl_parse Input path does not exist: file:/opt/nutch/20130412115759/parse_data Input path does not exist: file:/opt/nutch/20130412115759/parse_text
任何想法爲什麼這個腳本不工作?我認爲它必須是腳本本身的錯誤,而不是我的配置,因爲它正在尋找的路徑不存在,不知道爲什麼它甚至會在那裏看。