0
我有一個2 RDDs。在Spark scala中,如果他們具有相同的ID,我如何加入event1001RDD和event2009RDD?Spark:組ID由ID
VAL event1001RDD:schemaRDD = [事件類型,ID,位置,日期1]
[1001,4929102,LOC01,2015-01-20 10:44:39]
[1001,4929103,LOC02,2015-01-20 10:44:39]
[1001,4929104,LOC03,2015-01-20 10:44:39]
VAL event2009RDD:schemaRDD = [事件類型,ID,DATE1,DATE2]
[2009,4929101,2015-01-20 20:44:39,2015-01-20 20:44:39]
[2009,4929102,2015-01-20 15:44:39,2015-01-20 21:44:39]
[2009,4929103,2015-01-20 14:44:39,2015-01-20 14:44:39]
[2009,4929105,2015-01-20 20:44:39,2015-01-20 20:44:39]
預期的結果將be:(unique)(按ID排序)
[eventtype,id,1001's location,1001's date1,2009's date1,2009's date2]
2009,4929101,NULL,NULL,2015-01-20 20:44:39,2015-01-20 20:44:39
1001,4929102,LOC01,2015-01-20 10:44:39,2015-01-20 15:44:39,2015-01-20 21:44:39
1001,4929103,LOC02,2015-01-20 10:44:39,2015-01-20 14:44:39,2015-01-20 14:44:39
1001,4929104,LOC03,2015-01-20 10:44:39,NULL,NULL
2009,4929105,NULL,NULL,2015-01-20 20:44:39,2015-01-20 20:44:39
請注意,對於ID 4929102,1001用作事件類型。 2009 eventtype只能在1001中沒有任何匹配的id時使用。
它可以是RDD [String] - flat。或通過aggregateByKey獲得RDD元組。我只需要遍歷RDD。
這正是我需要的。謝謝ayan! :) – sophie
嗨阿彥,我需要更新SQL,因爲我現在有3個RDDs,你能看看嗎?謝謝http://stackoverflow.com/questions/30472975/spark-group-rdd-sql-query – sophie
回答mate ....請隨時讓我知道,如果它的工作(或不) –