// create RDD data
scala> val data = sc.parallelize(List(("sess-1","read"), ("sess-1","meet"),
("sess-1","walk"), ("sess-2","watch"),("sess-2","sleep"),
("sess-2","run"),("sess-2","drive")))
//groupByKey will return Iterable[String] CompactBuffer**
scala> val dataCB = data.groupByKey()`
//map CompactBuffer to List
scala> val tx = dataCB.map{case (col1,col2) => (col1,col2.toList)}.collect
data: org.apache.spark.rdd.RDD[(String, String)] =
ParallelCollectionRDD[211] at parallelize at <console>:26
dataCB: org.apache.spark.rdd.RDD[(String, Iterable[String])] =
ShuffledRDD[212] at groupByKey at <console>:30
tx: Array[(String, List[String])] = Array((sess-1,List(read, meet,
walk)), (sess-2,List(watch, sleep, run, drive)))
//groupByKey and map to List can also achieved in one statment
scala> val dataCB = data.groupByKey().map{case (col1,col2)
=> (col1,col2.toList)}.collect
感謝您的答覆......我問是有點不同,我發現,通過做一些RND,這我在下面發帖 –