我試圖加入兩個數據集。其中一個類型(Id,salesRecord)另一個(Id,Name)。 第一個數據集由HashPartitioner分區,第二個數據集由Custom Partitioner分區。當我通過id連接這些RDD並嘗試查看哪些分區信息被保留時,我隨機看到有些時候joinRDD會顯示自定義分區,有時候會顯示HashPartitioner。我還收到了不同的分割結果,同時還更改了分區數量。加入的RDD上的隨機分區程序行爲
根據Learning Spark的書,rdd1.join(rdd2)保留了rdd1的分區信息。
這是代碼。
val hashPartitionedRDD = cusotmerIDSalesRecord.partitionBy(new HashPartitioner(10))
println("hashPartitionedRDD's partitioner " + hashPartitionedRDD.partitioner) // Seeing Instance of HashParitioner
val customPartitionedRDD = customerIdNamePair1.partitionBy(new CustomerPartitioner)
println("customPartitionedRDD partitioner " + customPartitionedRDD.partitioner) // Seeing instance of CustomPartitioner
val expectedHash = hashPartitionedRDD.join(customPartitionedRDD)
val expectedCustom = customPartitionedRDD.join(hashPartitionedRDD)
println("Expected Hash " + expectedHash.partitioner) // Seeing instance of Custom Partitioner
println("Expected Custom " + expectedCustom.partitioner) //Seeing instance of Custom Partitioner
// Just to add more to it when number of partitions of both the data sets I made equal I am seeing the reverse results. i.e.
// expectedHash shows CustomPartitioner and
// expectedCustom shows Hashpartitioner Instance.
Thanks @Mohitt。有你,但是當兩個數據集有相同數量的分區程序時,我看到相反的結果 –
@java_enthu:答案已更新。 – Mohitt