2016-09-03 34 views
0

我試圖在Google DataProc羣集上適配Spark(2.0.0)中的ml模型。在擬合模型時,我收到Executor心跳超時錯誤。我該如何解決這個問題?Executor心跳超時DataProc上的Spark

其他解決方案表明這可能是由於(執行者之一)的內存不足。我將其解讀爲解決方案:設置正確的設置,重新分區,緩存並獲得更大的羣集。我能做些什麼,最好是不設置更大的羣集? (讓更多的/分區不緩存少調整設置???)

我的設置: 1名碩士和2名工人都具有相同的規格:N1-HIGHMEM在谷歌DataProc集羣

星火2.0.0 -8 - > 8個vCPU,52.0 GB存儲器 - 500GB磁盤

設置:

spark\:spark.executor.cores=1 
distcp\:mapreduce.map.java.opts=-Xmx2457m 
spark\:spark.driver.maxResultSize=1920m 
mapred\:mapreduce.map.java.opts=-Xmx2457m 
yarn\:yarn.nodemanager.resource.memory-mb=6144 
mapred\:mapreduce.reduce.memory.mb=6144 
spark\:spark.yarn.executor.memoryOverhead=384 
mapred\:mapreduce.map.cpu.vcores=1 
distcp\:mapreduce.reduce.memory.mb=6144 
mapred\:yarn.app.mapreduce.am.resource.mb=6144 
mapred\:mapreduce.reduce.java.opts=-Xmx4915m 
yarn\:yarn.scheduler.maximum-allocation-mb=6144 
dataproc\:dataproc.scheduler.max-concurrent-jobs=11 
dataproc\:dataproc.heartbeat.master.frequency.sec=30 
mapred\:mapreduce.reduce.cpu.vcores=2 
distcp\:mapreduce.reduce.java.opts=-Xmx4915m 
distcp\:mapreduce.map.memory.mb=3072 
spark\:spark.driver.memory=3840m 
mapred\:mapreduce.map.memory.mb=3072 
yarn\:yarn.scheduler.minimum-allocation-mb=512 
mapred\:yarn.app.mapreduce.am.resource.cpu-vcores=2 
spark\:spark.yarn.am.memoryOverhead=384 
spark\:spark.executor.memory=2688m 
spark\:spark.yarn.am.memory=2688m 
mapred\:yarn.app.mapreduce.am.command-opts=-Xmx4915m 

全部錯誤:

Py4JJavaError:錯誤occurr同時調用o4973.fit。 :org.apache.spark.SparkException:由於階段失敗導致作業中止:階段16964.0中的任務151失敗4次,最近失敗:失敗的任務151.3在階段16964.0(TID 779444,reco-test-w-0.c。 datasetredouteasvendor.internal):ExecutorLostFailure(執行程序14退出由其中一個正在運行的任務引起)原因:執行程序心跳在175122後超時ms 驅動程序堆棧跟蹤: at org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1450) at org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1438) at org.apache.spark.scheduler.DAGScheduler $ $ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1437) at scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) at org.apache.spark.scheduler.DAGScheduler.abortStage上的scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) (DAGScheduler.scala:1437) at org.apache.spark.scheduler.DAGScheduler $ $ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option。階:257) 在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) 在org.apache。 spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) at org.apache.spark.scheduler.DAGSchedulerEv entProcessLoop.onReceive(DAGScheduler.scala:1607) at org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler。 Scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) at org.apache.spark.SparkContext。 runJob(SparkContext.scala:1897) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911) at org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply(RDD.scala: 893) at org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) at org.apache.spark.rdd.RDD.collect(RDD.scala:892) at org.apache.spark.rdd。 PairRDDFunctions $$ anonfun $ countByKey $ 1.適用(PairRDDFunctions。斯卡拉:372) 到org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ countByKey $ 1.適用(PairRDDFunctions.scala:372) 到org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151 ) 到org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112) 到org.apache.spark.rdd.RDD.withScope(RDD.scala:358) 到org.apache.spark。 rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:371) 到org.apache.spark.rdd.RDD $$ anonfun $ countByValue $ 1.適用(RDD.scala:1156) 到org.apache.spark.rdd.RDD $ $ anonfun $ countByValue $ 1.適用(RDD.scala:1156) 到org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 到org.apache.spark.rdd.RDDOperationScope $ .withScope( RDDOperationSco pe.scala:112) 到org.apache.spark.rdd.RDD.withScope(RDD.scala:358) 到org.apache.spark.rdd.RDD.countByValue(RDD.scala:1155) 爲org。 apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:91) 到org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:66) 到sun.reflect.NativeMethodAccessorImpl.invoke0(本地方法) 到sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 到sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 到java.lang.reflect.Method.invoke(Method.java :498) 到py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) 到py4j.reflection.ReflectionEngine.invoke(Reflecti onEngine.java:357) 到py4j.Gateway.invoke(Gateway.java:280) 到py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) 到py4j.commands.CallCommand.execute(CallCommand.java: 79) 到py4j.GatewayConnection.run(GatewayConnection.java:211) 到java.lang.Thread.run(Thread.java:745)

+0

你明確地設置,'星火:spark.executor.cores'和'星火:spark.executor.memory'自己呢?默認情況下Dataproc YN OES一個以上每執行craidd,並且它也像'星火:spark.executor.memory'副從計算的N1-HIGHMEM-8 Dataproc的默認值不同。 –

+0

總之,讓每gwaith子車頂的單位更多的內存最簡單的方法只需調整'spark.executor.memory'; gellwch硅鉻合金是這個工作提交amser沒有重建集羣; OS使用''火花shell'或命令行代替pyspark' Dataproc作業提交ICHI運行'pyspark --conf spark.executor.memory = 5376m'例如。你不應該有什麼sy'n屋頂曲柄紋身數起來,可以在AET直到卡打一臺機器大約整體的大小;每執行人更大的存儲容量,你會OES較少的執行人,雖然如此,可能會閒置的幾個核心使用更大的內存設置。 –

+0

你是對的。對於N1-HIGHMEM-8簇的safonau是spark.executor.memory =18619米和spark.executor.cores = 4。 MP要OES 8名craidd工人52GB的內存,我可以設置spark.executor.memory = 5萬米和spark.executor.cores = 8?這是高頂? – Stijn

回答

2

由於這個問題並不OES答案,總結問題顯得OES是一個hawliau屋頂spark.executor.memory是設置的太低,導致上執行了內存不足的偶然的錯誤。

建議的修復是第一次嘗試默認Dataproc與配置開始,sydd我試圖gwbl defnyddio在多個核和實例存儲蓋爾。 OS問題繼續,wedyn調整spark.executor.memory加量和每個任務(基本上spark.executor.memory/spark.executor.cores)內存spark.executor.cores蓋爾

更多關於丹尼斯細節也給出了答案傷殘人體育組織對Dataproc星火內存配置:
Google Cloud Dataproc configuration issues

相關問題