2014-07-16 40 views
0

首先,我是一個Nutch/Hadoop新手。我已經安裝了Cassandra。我在我的EMR集羣的主節點上安裝了Nutch。當我嘗試使用下面的命令來執行抓取:Nutch - 獲取錯誤:未設置JAVA_HOME。當試圖抓取

sudo bin/crawl crawl urls -dir crawl -depth 3 -topN 5 

我得到

Error: JAVA_HOME is not set. 

如果我沒有「須藤」我得到運行命令:

Injector: starting at 2014-07-16 02:12:24 
Injector: crawlDb: urls/crawldb 
Injector: urlDir: crawl 
Injector: Converting injected urls to crawl db entries. 
Injector: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/hadoop/apache-nutch-1.8/runtime/local/crawl 
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) 
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) 
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) 
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) 
    at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) 
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) 
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:415) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) 
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) 
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) 
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) 
    at org.apache.nutch.crawl.Injector.inject(Injector.java:279) 
    at org.apache.nutch.crawl.Injector.run(Injector.java:316) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) 
    at org.apache.nutch.crawl.Injector.main(Injector.java:306) 

我想不通這一點。我見過這裏的其他論壇:Similar Topic

並遵循它無濟於事。我已經加入

export JAVA_HOME=/usr/lib/jvm/java-7-oracle 

export PATH=$PATH:${JAVA_HOME}/bin 

了我的〜/ .bashrc中,我使用的Linux ..

任何幫助將不勝感激!

回答

0

問題是我跑

sudo bin/crawl crawl urls -dir crawl -depth 3 -topN 5 

我用

bin/crawl ./urls/seed.txt TestCrawl http://localhost:8983/solr/ 5 

,一切都很好,只是一個畸形的命令..即 '爬' 的規定在這裏不推薦使用:Apache Nutch Tutorial