2017-09-26 28 views
0

我有這個數據集的:加入對鍵值中對關鍵地圖

(apple,1) 
(banana,4) 
(orange,3) 
(grape,2) 
(watermelon,2) 

,而其他數據集是:

(apple,Map(Bob -> 1)) 
(banana,Map(Chris -> 1)) 
(orange,Map(John -> 1)) 
(grape,Map(Smith -> 1)) 
(watermelon,Map(Phil -> 1)) 

我瞄準結合兩套得到:

(apple,1,Map(Bob -> 1)) 
(banana,4,Map(Chris -> 1)) 
(orange,3,Map(John -> 1)) 
(grape,2,Map(Smith -> 1)) 
(watermelon,2,Map(Phil -> 1)) 

代碼我:

... 
val counts_firstDataset = words.map(word => 
(word.firstWord, 1)).reduceByKey{case (x, y) => x + y} 

第二個數據集:

... 
val counts_secondDataset = secondSet.map(x => (x._1, 
x._2.toList.groupBy(identity).mapValues(_.size))) 

我試圖用join方法val joined_data = counts_firstDataset.join(counts_secondDataset)但沒有奏效,因爲聯接需要對[ K,V]。我將如何解決這個問題?

+0

@philantrovert RDDS –

+1

明白了。我應該完全讀完這個問題。 – philantrovert

+0

你用什麼數據結構來存儲這些數據集?列表,設置等? – fcat

回答

1

最簡單的辦法就是將轉換爲DataFrames,然後join

import spark.implicits._ 
val counts_firstDataset = words 
    .map(word => (word.firstWord, 1)) 
    .reduceByKey{case (x, y) => x + y} 
    .toDF("type", "value") 

val counts_secondDataset = secondSet 
    .map(x => (x._1,x._2.toList.groupBy(identity).mapValues(_.size))) 
    .toDF("type_2","map") 

counts_firstDataset 
    .join(counts_secondDataset, 'type === 'type_2) 
    .drop('type_2) 
1

作爲第一個元素(如水果的名稱)的兩個名單以相同的順序,你可以結合元組的兩個列表使用拉鍊然後用到列表改爲一個元組通過以下方式:

counts_firstDataset.zip(counts_secondDataset) 
    .map(vk => (vk._1._1, vk._1._2, vk._2._2))