我在獨立模式下運行spark集羣,並使用spark-submit運行應用程序。在spark UI階段中,我發現執行階段執行時間很長(> 10h,通常時間約30秒)。階段有許多失敗的任務,錯誤Resubmitted (resubmitted due to lost executor)
。在舞臺頁面有Aggregated Metrics by Executor
部分的地址爲CANNOT FIND ADDRESS
的執行者。 Spark試圖無限次地重新提交此任務。如果我殺了這個階段(我的應用程序會自動重新運行未完成的火花作業),所有工作都會繼續良好。Spark應用程序終止執行程序
另外,我在火花日誌中發現了一些奇怪的條目(與stage執行開始時相同)。
站長:
16/11/19 19:04:32 INFO Master: Application app-20161109161724-0045 requests to kill executors: 0
16/11/19 19:04:36 INFO Master: Launching executor app-20161109161724-0045/1 on worker worker-20161108150133
16/11/19 19:05:03 WARN Master: Got status update for unknown executor app-20161109161724-0045/0
16/11/25 10:05:46 INFO Master: Application app-20161109161724-0045 requests to kill executors: 1
16/11/25 10:05:48 INFO Master: Launching executor app-20161109161724-0045/2 on worker worker-20161108150133
16/11/25 10:06:14 WARN Master: Got status update for unknown executor app-20161109161724-0045/1
工人:
16/11/25 10:06:05 INFO Worker: Asked to kill executor app-20161109161724-0045/1
16/11/25 10:06:08 INFO ExecutorRunner: Runner thread for executor app-20161109161724-0045/1 interrupted
16/11/25 10:06:08 INFO ExecutorRunner: Killing process!
16/11/25 10:06:13 INFO Worker: Executor app-20161109161724-0045/1 finished with state KILLED exitStatus 137
16/11/25 10:06:14 INFO Worker: Asked to launch executor app-20161109161724-0045/2 for app.jar
16/11/25 10:06:17 INFO SecurityManager: Changing view acls to: spark
16/11/25 10:06:17 INFO SecurityManager: Changing modify acls to: spark
16/11/25 10:06:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark)
沒有與因爲工人,主站(以上日誌)的網絡連接,驅動程序中的同一計算機上運行也沒有問題。
星火版本1.6.1
您能添加導致故障的工作人員的日誌嗎?任務失敗次數可能導致工人死亡。有沒有例外發生? –
@YuvalItzchakov工作人員從丟失執行者的工人登錄工作日誌。在遺囑執行人失蹤之前,沒有任何例外和失敗。 – Cortwave
*「工人登錄工人失去執行人員的職位」*不確定這是什麼意思 –