2016-09-28 30 views
1

我有這段代碼在幾個月內工作正常,大約2個月前停止了Google Dataproc的工作,儘管我並沒有改變任何一行。Spark工作變得與Google Dataproc不兼容

我能重現bug用短短的幾行,所以我沒有張貼的代碼塊巨大:

SparkConf sparkConf = new SparkConf().setAppName("test"); 
JavaSparkContext jsc = new JavaSparkContext(sparkConf); 

JavaRDD<String> rdd = jsc.parallelize(Arrays.asList("a", "b", "c")); 
JavaPairRDD<String, String> pairs = rdd.flatMapToPair(value -> 
     Arrays.asList(
       new Tuple2<>(value, value + "1"), 
       new Tuple2<>(value, value + "2") 
     ) 
); 
pairs.collect().forEach(System.out::println); 

然後我得到這個不起眼的例外:

WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, test-cluster-w-1.c.test-project.internal): java.lang.AbstractMethodError: uk.co.test.CalculateScore$$Lambda$10/1666820030.call(Ljava/lang/Object;)Ljava/util/Iterator; 
     at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:142) 
     at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:142) 
     at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) 
     at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) 
     at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
     at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
     at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) 
     at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) 
     at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) 
     at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) 
     at scala.collection.AbstractIterator.to(Iterator.scala:1336) 
     at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) 
     at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) 
     at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) 
     at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) 
     at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) 
     at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) 
     at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) 
     at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) 
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
     at org.apache.spark.scheduler.Task.run(Task.scala:85) 
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
     at java.lang.Thread.run(Thread.java:745) 

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, test-cluster-w-0.c.test-project.internal): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed: container_1475077182957_0001_01_000005 on host: sun-recommendations-evaluation-w-0.c.test-project.internal. Exit status: 50. Diagnostics: Exception from container-launch. 
Container id: container_1475077182957_0001_01_000005 
Exit code: 50 
Stack trace: ExitCodeException exitCode=50: 
     at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) 
     at org.apache.hadoop.util.Shell.run(Shell.java:456) 
     at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) 
     at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) 
     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) 
     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) 
     at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
     at java.lang.Thread.run(Thread.java:745) 


Container exited with a non-zero exit code 50 

Driver stacktrace: 
     at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) 
     at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 
     at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) 
     at scala.Option.foreach(Option.scala:257) 
     at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) 
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) 
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) 
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) 
     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911) 
     at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893) 
     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
     at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) 
     at org.apache.spark.rdd.RDD.collect(RDD.scala:892) 
     at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:360) 
     at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45) 
     at uk.co.test.CalculateScore.main(CalculateScore.java:50) 
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
     at java.lang.reflect.Method.invoke(Method.java:498) 
     at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) 
     at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) 
     at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) 
     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) 
     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 

如果我與本地運行:

sparkConf.setMaster("local[2]") 

然後正常工作,輸出:

(a,a1) 
(a,a2) 
(b,b1) 
(b,b2) 
(c,c1) 
(c,c2) 

這些都是我的星火依賴關係:

<dependency> 
     <groupId>org.apache.spark</groupId> 
     <artifactId>spark-core_2.10</artifactId> 
     <version>1.6.0</version> 
    </dependency> 
    <dependency> 
     <groupId>org.apache.spark</groupId> 
     <artifactId>spark-mllib_2.10</artifactId> 
     <version>1.6.0</version> 
    </dependency> 
    <dependency> 
     <groupId>org.apache.spark</groupId> 
     <artifactId>spark-streaming_2.10</artifactId> 
     <version>1.6.0</version> 
    </dependency> 

任何幫助表示讚賞。

回答

4

Dataproc使用的默認圖像最近升級到Spark 2.0/Scala 2.11。這在八月份發生了變化,可能可以解釋這種差異。

This page詳細說明每個Dataproc映像發行版中包含哪些版本的軟件包。

可能就足夠用下面的更新您的pom.xml,重新編譯,然後重新運行你的應用程序

<dependency> 
    <groupId>org.apache.spark</groupId> 
    <artifactId>spark-core_2.11</artifactId> 
    <version>2.0.0</version> 
</dependency> 
<dependency> 
    <groupId>org.apache.spark</groupId> 
    <artifactId>spark-mllib_2.11</artifactId> 
    <version>2.0.0</version> 
</dependency> 
<dependency> 
    <groupId>org.apache.spark</groupId> 
    <artifactId>spark-streaming_2.11</artifactId> 
    <version>2.0.0</version> 
</dependency> 

release notes爲Spark 2.0包含星火1.6和2.0之間changes and removals

作爲替代方案,您可以使用以下gcloud調用仍然使用1.0圖像軌跡:

$ gcloud dataproc clusters create --image-version 1.0 ... 

當使用一個明確的圖像跟蹤,記住,主要/次要版本都可以廢棄,最終被刪除。可以查閱Dataproc image versioning政策以獲取圖像版本的支持時間表。

+1

謝謝,安格斯,它工作得很好。將版本更改爲2.11將需要更改代碼,所以同時我必須使用'--image-version 1.0'。對於有同樣問題的其他人,我建議將--image-version設置爲當前版本,以避免與未來的Dataproc升級產生向後兼容性問題。 – cahen

+1

好點。我還添加了一個指向圖像版本支持策略的鏈接,只是爲了調出圖像版本不被無限支持。 –