數據分佈，而在星火重新分區RDD

2017-04-21 46 views 0 likes

考慮下面的代碼片段（Python的2.7運行星火2.1）：數據分佈，而在星火重新分區RDD

nums = range(0, 10) 

with SparkContext("local[2]") as sc: 
    rdd = sc.parallelize(nums) 
    print("Number of partitions: {}".format(rdd.getNumPartitions())) 
    print("Partitions structure: {}".format(rdd.glom().collect())) 

    rdd2 = rdd.repartition(5) 
    print("Number of partitions: {}".format(rdd2.getNumPartitions())) 
    print("Partitions structure: {}".format(rdd2.glom().collect()))

輸出是：

Number of partitions: 2 
Partitions structure: [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]] 

Number of partitions: 5 
Partitions structure: [[], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [], [], []]

爲什麼重新劃分數據後未在所有分佈式分區？

來源

2017-04-21 Khozzy

回答

在pyspark repartition是coalesce(numPartitions, shuffle=True)（see core code here）.IE數據被網絡上所有混洗和分區在輪循方式的意義進行，第一條記錄進入所述第一處理節點，第二至所述第二處理節點，但在你的情況下，因爲你只分配了local[2]即兩個（假設的）節點，但是我的猜測是spark只能從本地機器獲得一個核心，所以它將所有值都放在任務運行的特定節點上。

來源

2017-04-21 14:15:14 Pushkr

感謝您的評論。我不認爲情況會如此。此方法在使用DataFrame時有效（請參閱https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4），但在純RDD上失敗 – Khozzy

相關問題

11. 如何在火花中將rdd數據分成兩部分？
12. 星火2.1 PySpark錯誤：sc.textFile（「的test.txt」）重新分區（2）.collect（）
13. RDD分區
14. 星火RDD不Elasticsearch
15. 星火變換RDD
16. 星火RDD容錯
17. 星火：保存RDD在HDFS
18. 工作如何被分佈在星火
19. 火花中的數據幀重新分區不起作用
20. 多RDD與分區？
21. RDD分區邏輯
22. 星火AVRO S3讀不工作的分區數據
23. 星火SQL數據框中改造涉及分區和滯後
24. 分區由星火SQL多列
25. 星火RDD：設置差異
26. 星火RDD外部存儲
27. 如何測試星火RDD
28. 星火RDD寫入HBase的
29. 星火：使用同一RDD
30. 星火RDD成矩陣