2017-04-21 46 views
0

考慮下面的代碼片段(Python的2.7運行星火2.1):數據分佈,而在星火重新分區RDD

nums = range(0, 10) 

with SparkContext("local[2]") as sc: 
    rdd = sc.parallelize(nums) 
    print("Number of partitions: {}".format(rdd.getNumPartitions())) 
    print("Partitions structure: {}".format(rdd.glom().collect())) 

    rdd2 = rdd.repartition(5) 
    print("Number of partitions: {}".format(rdd2.getNumPartitions())) 
    print("Partitions structure: {}".format(rdd2.glom().collect())) 

輸出是:

Number of partitions: 2 
Partitions structure: [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]] 

Number of partitions: 5 
Partitions structure: [[], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [], [], []] 

爲什麼重新劃分數據後未在所有分佈式分區?

回答

0

在pyspark repartitioncoalesce(numPartitions, shuffle=True)see core code here).IE數據被網絡上所有混洗和分區在輪循方式的意義進行,第一條記錄進入所述第一處理節點,第二至所述第二處理節點,但在你的情況下,因爲你只分配了local[2]即兩個(假設的)節點,但是我的猜測是spark只能從本地機器獲得一個核心,所以它將所有值都放在任務運行的特定節點上。

+0

感謝您的評論。我不認爲情況會如此。此方法在使用DataFrame時有效(請參閱https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4),但在純RDD上失敗 – Khozzy