創建一個模型來表示數據(你可以使用元組爲好,但與元組編碼將很快變得醜陋。它總是好的田裏有名字)
case class DataItem(key: Int, value: String, timeInMillis: Long)
然後
分析數據(可以使用喬達DateTimeFormat解析日期時間),然後創建您的RDD
val rdd = sc.parallelize(List(DataItem(1, "A", 123), DataItem(2, "B", 1234), DataItem(2, "C", 12345)))
,然後最後一步groupBy
鍵和sortByŧ IME
rdd.groupBy(_.key).map { case (k, v) => k -> v.toList.sortBy(_.timeInMillis)}
斯卡拉REPL
scala> case class DataItem(key: Int, value: String, timeInMillis: Long)
defined class DataItem
scala> sc.parallelize(List(DataItem(1, "A", 123), DataItem(2, "B", 1234), DataItem(2, "C", 12345)))
res10: org.apache.spark.rdd.RDD[DataItem] = ParallelCollectionRDD[12] at parallelize at <console>:36
scala> val rdd = sc.parallelize(List(DataItem(1, "A", 123), DataItem(2, "B", 1234), DataItem(2, "C", 12345)))
rdd: org.apache.spark.rdd.RDD[DataItem] = ParallelCollectionRDD[13] at parallelize at <console>:35
scala> rdd.groupBy(_.key).map { case (k, v) => k -> v.toList.sortBy(_.timeInMillis)}
res11: org.apache.spark.rdd.RDD[(Int, List[DataItem])] = MapPartitionsRDD[16] at map at <console>:38
scala> rdd.groupBy(_.key).map { case (k, v) => k -> v.toList.sortBy(_.timeInMillis)}.foreach(println)
(1,List(DataItem(1,A,123)))
(2,List(DataItem(2,B,1234), DataItem(2,C,12345)))
scala> rdd.groupBy(_.key).map { case (k, v) => k -> v.toList.sortBy(_.timeInMillis)}.map { case (k, v) => (k, v.map(_.value)) }.foreach(println)
(1,List(A))
(2,List(B, C))
你應該嘗試你問的問題計算器之前 – pamu