2013-04-12 53 views
0

我試圖運行Nutch 1.6「bin/crawl」中提供的腳本,該腳本執行以下所有手動步驟以關閉並蜘蛛網站。Nutch bin /抓取腳本失敗 - 手動步驟工作正常

當我執行這些步驟手動一切工作正常和預期(儘管只有一個頁面,但會考慮這一點)

包含URL @種子/

創建的文本文件urls.txt

我的網頁建立索引
bin/nutch inject crawl_test/crawldb seeds/ 

bin/nutch generate crawl_test/crawldb crawl_test/segments 

export SEGMENT=crawl_test/segments/`ls -tr crawl_test/segments|tail -1` 

bin/nutch fetch $SEGMENT -noParsing 

bin/nutch parse $SEGMENT 

bin/nutch updatedb crawl_test/crawldb $SEGMENT -filter -normalize 

bin/nutch invertlinks crawl_test/linkdb -dir crawl_test/segments 

bin/nutch solrindex http://dev:8080/solr/ crawl_test/crawldb -linkdb crawl_test/linkdb crawl_test/segments/* 

中的bin /爬行腳本給這個錯誤...

Indexing 20130412115759 on SOLR index -> someurl:8080/solr/ 
SolrIndexer: starting at 2013-04-12 11:58:47 
SolrIndexer: deleting gone documents: false 
SolrIndexer: URL filtering: false 
SolrIndexer: URL normalizing: false 
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch/20130412115759/crawl_fetch 
Input path does not exist: file:/opt/nutch/20130412115759/crawl_parse 
Input path does not exist: file:/opt/nutch/20130412115759/parse_data 
Input path does not exist: file:/opt/nutch/20130412115759/parse_text 

任何想法爲什麼這個腳本不工作?我認爲它必須是腳本本身的錯誤,而不是我的配置,因爲它正在尋找的路徑不存在,不知道爲什麼它甚至會在那裏看。

回答

1

貌似有一個與bin/crawl腳本

- $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $SEGMENT 
+ $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT 
錯誤
相關問題