RDD分區邏輯

我想了解RDD分區邏輯。 RDD在節點間進行分區，但希望瞭解這種分區邏輯的工作原理。RDD分區邏輯

我有4個內核分配給它的虛擬機。我創建了兩個RDD，一個來自HDFS，一個來自並行操作。得到了創建

第一次兩個分隔但在第二操作4分區得到創建。

我檢查了分配給文件的塊號 - 它是1塊，因爲文件非常小，但是當我在該文件上創建了RDD時，它顯示了兩個分區。爲什麼是這樣？我在某處看到，分區還取決於核心，在我的情況下，這仍然不能滿足該輸出。

有人可以幫助理解這一點嗎？

2016-05-22 Shashi

[HDFS上的文件數據分區如何工作？]（http://stackoverflow.com/questions/29011574/how-does-partitioning-work-for-data-from-files-on- hdfs） – sgvd

的textFile完整的簽名是：

textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

以第二個參數，minPartitions，你可以設置你想要得到的分區的最小量。正如你所看到的，默認情況下它設置爲defaultMinPartitions，這反過來又被定義爲：

def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

的defaultParalellism值配置了spark.default.parallelism設置，默認運行星火時候取決於你的內核數量在本地模式下。這是4你的情況，所以你得到min(4, 2)，這就是爲什麼你得到2個分區。

來源

2016-05-22 13:47:22 sgvd

偉大的答案。你是怎麼想到的：「defaultParalellism的值是用spark.default.parallelism設置配置的」?? – bigdatamann

首先通過假設變量的命名和設置不是偶然的;）但通過以下代碼：https：//github.com/apache/spark/blob/master/core/src/main/scala/org/apache /spark/SparkContext.scala#L2321從taskscheduler設置它，這裏https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl。 scala＃L517從後臺獲取它在這裏https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/local/LocalSchedulerBackend.scala#L144獲取它從設置 – sgvd

真棒！非常感謝 – bigdatamann

回答

相關問題