我試圖在Google DataProc羣集上適配Spark(2.0.0)中的ml模型。在擬合模型時,我收到Executor心跳超時錯誤。我該如何解決這個問題?Executor心跳超時DataProc上的Spark
其他解決方案表明這可能是由於(執行者之一)的內存不足。我將其解讀爲解決方案:設置正確的設置,重新分區,緩存並獲得更大的羣集。我能做些什麼,最好是不設置更大的羣集? (讓更多的/分區不緩存少調整設置???)
我的設置: 1名碩士和2名工人都具有相同的規格:N1-HIGHMEM在谷歌DataProc集羣
星火2.0.0 -8 - > 8個vCPU,52.0 GB存儲器 - 500GB磁盤
設置:
spark\:spark.executor.cores=1
distcp\:mapreduce.map.java.opts=-Xmx2457m
spark\:spark.driver.maxResultSize=1920m
mapred\:mapreduce.map.java.opts=-Xmx2457m
yarn\:yarn.nodemanager.resource.memory-mb=6144
mapred\:mapreduce.reduce.memory.mb=6144
spark\:spark.yarn.executor.memoryOverhead=384
mapred\:mapreduce.map.cpu.vcores=1
distcp\:mapreduce.reduce.memory.mb=6144
mapred\:yarn.app.mapreduce.am.resource.mb=6144
mapred\:mapreduce.reduce.java.opts=-Xmx4915m
yarn\:yarn.scheduler.maximum-allocation-mb=6144
dataproc\:dataproc.scheduler.max-concurrent-jobs=11
dataproc\:dataproc.heartbeat.master.frequency.sec=30
mapred\:mapreduce.reduce.cpu.vcores=2
distcp\:mapreduce.reduce.java.opts=-Xmx4915m
distcp\:mapreduce.map.memory.mb=3072
spark\:spark.driver.memory=3840m
mapred\:mapreduce.map.memory.mb=3072
yarn\:yarn.scheduler.minimum-allocation-mb=512
mapred\:yarn.app.mapreduce.am.resource.cpu-vcores=2
spark\:spark.yarn.am.memoryOverhead=384
spark\:spark.executor.memory=2688m
spark\:spark.yarn.am.memory=2688m
mapred\:yarn.app.mapreduce.am.command-opts=-Xmx4915m
全部錯誤:
Py4JJavaError:錯誤occurr同時調用o4973.fit。 :org.apache.spark.SparkException:由於階段失敗導致作業中止:階段16964.0中的任務151失敗4次,最近失敗:失敗的任務151.3在階段16964.0(TID 779444,reco-test-w-0.c。 datasetredouteasvendor.internal):ExecutorLostFailure(執行程序14退出由其中一個正在運行的任務引起)原因:執行程序心跳在175122後超時ms 驅動程序堆棧跟蹤: at org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1450) at org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1438) at org.apache.spark.scheduler.DAGScheduler $ $ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1437) at scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) at org.apache.spark.scheduler.DAGScheduler.abortStage上的scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) (DAGScheduler.scala:1437) at org.apache.spark.scheduler.DAGScheduler $ $ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option。階:257) 在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) 在org.apache。 spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) at org.apache.spark.scheduler.DAGSchedulerEv entProcessLoop.onReceive(DAGScheduler.scala:1607) at org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler。 Scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) at org.apache.spark.SparkContext。 runJob(SparkContext.scala:1897) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911) at org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply(RDD.scala: 893) at org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) at org.apache.spark.rdd.RDD.collect(RDD.scala:892) at org.apache.spark.rdd。 PairRDDFunctions $$ anonfun $ countByKey $ 1.適用(PairRDDFunctions。斯卡拉:372) 到org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ countByKey $ 1.適用(PairRDDFunctions.scala:372) 到org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151 ) 到org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112) 到org.apache.spark.rdd.RDD.withScope(RDD.scala:358) 到org.apache.spark。 rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:371) 到org.apache.spark.rdd.RDD $$ anonfun $ countByValue $ 1.適用(RDD.scala:1156) 到org.apache.spark.rdd.RDD $ $ anonfun $ countByValue $ 1.適用(RDD.scala:1156) 到org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 到org.apache.spark.rdd.RDDOperationScope $ .withScope( RDDOperationSco pe.scala:112) 到org.apache.spark.rdd.RDD.withScope(RDD.scala:358) 到org.apache.spark.rdd.RDD.countByValue(RDD.scala:1155) 爲org。 apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:91) 到org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:66) 到sun.reflect.NativeMethodAccessorImpl.invoke0(本地方法) 到sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 到sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 到java.lang.reflect.Method.invoke(Method.java :498) 到py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) 到py4j.reflection.ReflectionEngine.invoke(Reflecti onEngine.java:357) 到py4j.Gateway.invoke(Gateway.java:280) 到py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) 到py4j.commands.CallCommand.execute(CallCommand.java: 79) 到py4j.GatewayConnection.run(GatewayConnection.java:211) 到java.lang.Thread.run(Thread.java:745)
你明確地設置,'星火:spark.executor.cores'和'星火:spark.executor.memory'自己呢?默認情況下Dataproc YN OES一個以上每執行craidd,並且它也像'星火:spark.executor.memory'副從計算的N1-HIGHMEM-8 Dataproc的默認值不同。 –
總之,讓每gwaith子車頂的單位更多的內存最簡單的方法只需調整'spark.executor.memory'; gellwch硅鉻合金是這個工作提交amser沒有重建集羣; OS使用''火花shell'或命令行代替pyspark' Dataproc作業提交ICHI運行'pyspark --conf spark.executor.memory = 5376m'例如。你不應該有什麼sy'n屋頂曲柄紋身數起來,可以在AET直到卡打一臺機器大約整體的大小;每執行人更大的存儲容量,你會OES較少的執行人,雖然如此,可能會閒置的幾個核心使用更大的內存設置。 –
你是對的。對於N1-HIGHMEM-8簇的safonau是spark.executor.memory =18619米和spark.executor.cores = 4。 MP要OES 8名craidd工人52GB的內存,我可以設置spark.executor.memory = 5萬米和spark.executor.cores = 8?這是高頂? – Stijn