我試圖運行一些火花的作業運行,但通常執行人耗盡內存:如何-傳播工作,從而不會出現內存不足
17/02/06 19:12:02 WARN TaskSetManager: Lost task 10.0 in stage 476.3 (TID 133250, 10.0.0.10): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container marked as failed: container_1486378087852_0006_01_000019 on host: 10.0.0.10. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_1486378087852_0006_01_000019
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)
既然我已經設置spark.executor.memory=20480m
,我覺得這個工作不應該真的需要更多的RAM來工作,所以我看到的另一個選擇是增加分區的數量。
我已經試過:
>>> sqlContext.setConf("spark.sql.shuffle.partitions", u"2001")
>>> sqlContext.getConf("spark.sql.shuffle.partitions")
u'2001'
和
>>> all_users.repartition(2001)
但是當我開始工作,我仍然可以看到默認的200分區:
>>> all_users.repartition(2001).show()
[Stage 526:(0 + 30)/200][Stage 527:>(0 + 0)/126][Stage 528:>(0 + 0)/128]0]
我使用PySpark 2.0.2 Azure HDInsight。任何人都可以指出我做錯了什麼?
編輯
根據下面我答案嘗試:
sqlContext.setConf('spark.sql.shuffle.partitions', 2001)
在開始,但沒有奏效。但是,這工作:
sqlContext.setConf('spark.sql.files.maxPartitionBytes', 100000000)
all_users是一個SQL數據框。一個具體的例子是:
all_users = sqlContext.table('RoamPositions')\
.withColumn('prev_district_id', F.lag('district_id', 1).over(user_window))\
.withColumn('prev_district_name', F.lag('district_name', 1).over(user_window))\
.filter('prev_district_id IS NOT NULL AND prev_district_id != district_id')\
.select('timetag', 'imsi', 'prev_district_id', 'prev_district_name', 'district_id', 'district_name')