我的腳本是用python編寫的,它在沒有docker環境的DSE 4.8上運行良好。現在我升級到DSE 5.0.4並在Docker環境中運行它,現在我得到了下面的RPC錯誤。在我使用DSE Spark版本1.4.1之前,現在我正在使用1.6.2。爲什麼會發生Spark 1.6.2 RPC錯誤消息?
主機操作系統Centos 7.2和Docker OS是一樣的。我們使用spark來提交一個任務,我們試着給執行者2G,4G,6G和8G,他們都給出了相同的錯誤信息。
相同的python腳本在我的以前的環境中運行沒有問題,但現在我更新它不能正常工作。
對於scala操作,代碼在當前環境中正常運行,只有Python部分存在問題。重置主機仍然沒有解決問題。重新創建碼頭集裝箱也沒有幫助解決問題。
編輯:
也許我的MapReduce的功能實在是太複雜了。這個問題可能在這裏但不確定。
規格環境: 羣集組由6主機,每臺主機具有16個內核的CPU,32G內存,500G SSD
不知道如何解決這個問題?此外,這個錯誤信息是什麼意思?非常感謝!讓我知道你是否需要更多信息。
錯誤日誌:
Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
WARN 2017-02-26 10:14:08,314 org.apache.spark.scheduler.TaskSetManager: Lost task 47.1 in stage 88.0 (TID 9705, 139.196.190.79): TaskKilled (killed intentionally)
Traceback (most recent call last):
File "/data/user_profile/User_profile_step1_classify_articles_common_sc_collect.py", line 1116, in <module>
compute_each_dimension_and_format_user(article_by_top_all_tmp)
File "/data/user_profile/User_profile_step1_classify_articles_common_sc_collect.py", line 752, in compute_each_dimension_and_format_user
sqlContext.createDataFrame(article_up_save_rdd, df_schema).write.format('org.apache.spark.sql.cassandra').options(keyspace='archive', table='articles_up_update').save(mode='append')
File "/opt/dse-5.0.4/resources/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 395, in save
WARN 2017-02-26 10:14:08,336 org.apache.spark.scheduler.TaskSetManager: Lost task 63.1 in stage 88.0 (TID 9704, 139.196.190.79): TaskKilled (killed intentionally)
File "/opt/dse-5.0.4/resources/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/opt/dse-5.0.4/resources/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/opt/dse-5.0.4/resources/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o795.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 619 in stage 88.0 failed 4 times, most recent failure: Lost task 619.3 in stage 88.0 (TID 9746, 139.196.107.73): ExecutorLostFailure (executor 59 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$han
泊塢窗命令:
docker run -d --net=host -i --privileged \
-e SEEDS=10.XX.XXx.XX 1,10.XX.XXx.XXX \
-e CLUSTER_NAME="MyCluster" \
-e LISTEN_ADDRESS=10.XX.XXx.XX \
-e BROADCAST_RPC_ADDRESS=139.XXX.XXX.XXX \
-e RPC_ADDRESS=0.0.0.0 \
-e STOMP_INTERFACE=10.XX.XXx.XX \
-e HOSTS=139.XX.XXx.XX \
-v /data/dse/lib/cassandra:/var/lib/cassandra \
-v /data/dse/lib/spark:/var/lib/spark \
-v /data/dse/log/cassandra:/var/log/cassandra \
-v /data/dse/log/spark:/var/log/spark \
-v /data/agent/log:/opt/datastax-agent/log \
--name dse_container registry..xxx.com/rechao/dse:5.0.4 -s
您更新的不僅僅是Datastax。您現在使用Docker,並且錯誤明確提到「超出閾值或網絡問題」,那麼您的主機操作系統以及您給執行程序的內存分配是什麼? –
@ cricket_007主機操作系統Centos 7.2和Docker操作系統是一樣的。我們使用spark來提交一個任務,我們試着給執行者2G,4G,6G和8G,他們都給出了相同的錯誤信息。任何想法爲什麼?謝謝 – peter
好的,那麼這可能是一個網絡問題。容器是否暴露了適當的端口? –