選擇考慮下面的數據集如何由Max(日期)與火花數據幀API
id v date
1 a1 1
1 a2 2
2 b1 3
2 b2 4
我只需要選擇的最後一個值(有關日期)每個ID。
我想出了這個代碼:
scala> val df = sc.parallelize(List((41,"a1",1), (1, "a2", 2), (2, "b1", 3), (2, "b2", 4))).toDF("id", "v", "date")
df: org.apache.spark.sql.DataFrame = [id: int, v: string, date: int]
scala> val agg = df.groupBy("id").max("date")
agg: org.apache.spark.sql.DataFrame = [id: int, max(date): int]
scala> val res = df.join(agg, df("id") === agg("id") && df("date") === agg("max(date)"))
16/11/14 22:25:01 WARN sql.Column: Constructing trivially true equals predicate, 'id#3 = id#3'. Perhaps you need to use aliases.
res: org.apache.spark.sql.DataFrame = [id: int, v: string, date: int, id: int, max(date): int]
有沒有更好的辦法(更地道,...)?
獎勵:如何在日期欄中執行最大值並避免此錯誤Aggregation function can only be applied on a numeric column.
?
你可以試試'from_unixtime'功能應用'agg'的日期領域。 – Shankar
我不確定這是否正常,但值得嘗試SQL:select max(date)as mdate,id from tmp_table group by id; – evgenii