0
考慮下面的代碼片段(Python的2.7運行星火2.1):數據分佈,而在星火重新分區RDD
nums = range(0, 10)
with SparkContext("local[2]") as sc:
rdd = sc.parallelize(nums)
print("Number of partitions: {}".format(rdd.getNumPartitions()))
print("Partitions structure: {}".format(rdd.glom().collect()))
rdd2 = rdd.repartition(5)
print("Number of partitions: {}".format(rdd2.getNumPartitions()))
print("Partitions structure: {}".format(rdd2.glom().collect()))
輸出是:
Number of partitions: 2
Partitions structure: [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]
Number of partitions: 5
Partitions structure: [[], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [], [], []]
爲什麼重新劃分數據後未在所有分佈式分區?
感謝您的評論。我不認爲情況會如此。此方法在使用DataFrame時有效(請參閱https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4),但在純RDD上失敗 – Khozzy