如何-傳播工作，從而不會出現內存不足

我試圖運行一些火花的作業運行，但通常執行人耗盡內存：如何-傳播工作，從而不會出現內存不足

17/02/06 19:12:02 WARN TaskSetManager: Lost task 10.0 in stage 476.3 (TID 133250, 10.0.0.10): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container marked as failed: container_1486378087852_0006_01_000019 on host: 10.0.0.10. Exit status: 52. Diagnostics: Exception from container-launch. 
Container id: container_1486378087852_0006_01_000019 
Exit code: 52 
Stack trace: ExitCodeException exitCode=52: 
     at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)

既然我已經設置spark.executor.memory=20480m ，我覺得這個工作不應該真的需要更多的RAM來工作，所以我看到的另一個選擇是增加分區的數量。

我已經試過：

>>> sqlContext.setConf("spark.sql.shuffle.partitions", u"2001") 
>>> sqlContext.getConf("spark.sql.shuffle.partitions") 
u'2001'

和

>>> all_users.repartition(2001)

但是當我開始工作，我仍然可以看到默認的200分區：

>>> all_users.repartition(2001).show() 
[Stage 526:(0 + 30)/200][Stage 527:>(0 + 0)/126][Stage 528:>(0 + 0)/128]0]

我使用PySpark 2.0.2 Azure HDInsight。任何人都可以指出我做錯了什麼？

編輯

根據下面我答案嘗試：

sqlContext.setConf('spark.sql.shuffle.partitions', 2001)

在開始，但沒有奏效。但是，這工作：

sqlContext.setConf('spark.sql.files.maxPartitionBytes', 100000000)

all_users是一個SQL數據框。一個具體的例子是：

all_users = sqlContext.table('RoamPositions')\ 
    .withColumn('prev_district_id', F.lag('district_id', 1).over(user_window))\ 
    .withColumn('prev_district_name', F.lag('district_name', 1).over(user_window))\ 
    .filter('prev_district_id IS NOT NULL AND prev_district_id != district_id')\ 
    .select('timetag', 'imsi', 'prev_district_id', 'prev_district_name', 'district_id', 'district_name')

來源

2017-02-06 André Cruz

根據您的意見，它看起來像你讀來自外部源的數據，並使用窗口函數調用repartition之前。窗口功能：

如果沒有提供partitionBy子句，則將數據重新分區到單個分區。
如果您提供partitionBy子句，請使用標準隨機播放機制。

後者似乎是這裏的情況。由於默認值spark.sql.shuffle.partition爲200，因此在重新分區之前，您的數據將被混洗爲200個分區。如果您希望2001年所有的方式，你應該把它之前加載數據

sqlContext.setConf("spark.sql.shuffle.partitions", u"2001") 

all_users = ...

而且spark.sql.shuffle.partitions不影響初始的分區數目。這些可以使用其他屬性進行控制：How to increase partitions of the sql result from HiveContext in spark sql

來源

2017-02-06 23:49:10 user6910411

如何-傳播工作，從而不會出現內存不足

回答

相關問題