2016-12-29 25 views
0

我是nutch新手。我已經安裝了nutch 2.3.1並將其配置爲使用mongodb。注入操作是成功的,但是當我嘗試生成它時會生成一個異常(參見下文)。 注意:此錯誤是由包含60K url的種子文件生成的。所以我試了100個網址,一切都很順利。RuntimeException當nutch生成

你知道這個錯誤的原因是什麼?謝謝 !!!

2016-12-30 00:01:48,446 INFO crawl.GeneratorJob - GeneratorJob: starting at 2016-12-30 00:01:48 
2016-12-30 00:01:48,447 INFO crawl.GeneratorJob - GeneratorJob: Selecting best-scoring urls due for fetch. 
2016-12-30 00:01:48,447 INFO crawl.GeneratorJob - GeneratorJob: starting 
2016-12-30 00:01:48,448 INFO crawl.GeneratorJob - GeneratorJob: filtering: true 
2016-12-30 00:01:48,448 INFO crawl.GeneratorJob - GeneratorJob: normalizing: true 
2016-12-30 00:01:48,448 INFO crawl.GeneratorJob - GeneratorJob: topN: 100000 
2016-12-30 00:01:48,816 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
2016-12-30 00:01:48,857 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 
2016-12-30 00:01:48,867 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 
2016-12-30 00:01:48,867 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 
2016-12-30 00:01:51,568 WARN conf.Configuration - file:/tmp/hadoop-mehdi/mapred/staging/mehdi1740651658/.staging/job_local1740651658_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 
2016-12-30 00:01:51,573 WARN conf.Configuration - file:/tmp/hadoop-mehdi/mapred/staging/mehdi1740651658/.staging/job_local1740651658_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 
2016-12-30 00:01:51,753 WARN conf.Configuration - file:/tmp/hadoop-mehdi/mapred/local/localRunner/mehdi/job_local1740651658_0001/job_local1740651658_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 
2016-12-30 00:01:51,760 WARN conf.Configuration - file:/tmp/hadoop-mehdi/mapred/local/localRunner/mehdi/job_local1740651658_0001/job_local1740651658_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 
2016-12-30 00:01:52,408 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 
2016-12-30 00:01:52,408 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 
2016-12-30 00:01:52,408 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 
2016-12-30 00:01:52,591 INFO regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default 
2016-12-30 00:02:03,229 ERROR mapreduce.GoraRecordReader - Error reading Gora records: Read operation to server localhost:27017 failed on database nutch 
2016-12-30 00:02:04,607 WARN mapred.LocalJobRunner - job_local1740651658_0001 
java.lang.Exception: java.lang.RuntimeException: com.mongodb.MongoException$Network: Read operation to server localhost:27017 failed on database nutch 
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) 
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) 
Caused by: java.lang.RuntimeException: com.mongodb.MongoException$Network: Read operation to server localhost:27017 failed on database nutch 
    at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:122) 
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533) 
    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80) 
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91) 
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) 
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) 
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) 
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) 
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
Caused by: com.mongodb.MongoException$Network: Read operation to server localhost:27017 failed on database nutch 
    at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:298) 
    at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:269) 
    at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:235) 
    at com.mongodb.QueryResultIterator.getMore(QueryResultIterator.java:145) 
    at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:135) 
    at com.mongodb.DBCursor._hasNext(DBCursor.java:626) 
    at com.mongodb.DBCursor.hasNext(DBCursor.java:657) 
    at org.apache.gora.mongodb.query.MongoDBResult.nextInner(MongoDBResult.java:71) 
    at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:111) 
    at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:118) 
    ... 12 more 
Caused by: java.io.EOFException 
    at org.bson.io.Bits.readFully(Bits.java:75) 
    at org.bson.io.Bits.readFully(Bits.java:50) 
    at org.bson.io.Bits.readFully(Bits.java:37) 
    at com.mongodb.Response.<init>(Response.java:42) 
    at com.mongodb.DBPort$1.execute(DBPort.java:164) 
    at com.mongodb.DBPort$1.execute(DBPort.java:158) 
    at com.mongodb.DBPort.doOperation(DBPort.java:187) 
    at com.mongodb.DBPort.call(DBPort.java:158) 
    at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:290) 
    ... 21 more 
2016-12-30 00:02:04,846 ERROR crawl.GeneratorJob - GeneratorJob: java.lang.RuntimeException: job failed: name=nutch-maven-1.0-SNAPSHOT.jar, jobid=job_local1740651658_0001 
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120) 
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:227) 
    at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:256) 
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:322) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) 
    at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:330) 

回答

1

我發現問題來自於mongodb版本。 Nutch使用mongo-java-driver-2.13.1.jar廣告我已經安裝了mongodb 3.4.1。所以我已經安裝了mongo 2.6.7,現在它工作正常。我會嘗試更新Nutch中的驅動程序,並告訴您它是否適用於新版本的mongodb。

+0

您是否成功更新? – rzo