與GROUPBY星火篩選數據統計

一個數據幀A_df像：與GROUPBY星火篩選數據統計

+------+----+-----+ 
| uid|year|month| 
+------+----+-----+ 
|  1|2017| 03| 
     1|2017| 05| 
|  2|2017| 01| 
|  3|2017| 02| 
|  3|2017| 04| 
|  3|2017| 05| 
+------+----+-----+

我想和發生時間過濾柱UID的2倍以上，預期結果：

+------+----+-----+ 
| uid|year|month| 
+------+----+-----+ 
|  3|2017| 02| 
|  3|2017| 04| 
|  3|2017| 05| 
+------+----+-----+

我怎樣才能得到這個由scala結果？我的解決方案：

val condition_uid = A_df.groupBy("uid") 
        .agg(count("*").alias("cnt")) 
        .filter("cnt > 2").select("uid") 
val results_df = A_df.join(condition_uid, Seq("uid"))

有沒有更好的答案？

來源

2017-08-15 wyb

如果答案幫助你可以作爲一個答案接受:) –

我認爲使用窗口函數是完美的解決方案，因爲您不必重新加入數據框。

val window = Window.partitionBy("uid").orderBy("year") 

df.withColumn("count", count("uid").over(window)) 
    .filter($"count" > 2).drop("count").show

輸出：

+---+----+-----+-----+ 
|uid|year|month|count| 
+---+----+-----+-----+ 
| 1|2017| 03| 2| 
| 1|2017| 05| 2| 
| 2|2017| 01| 1| 
+---+----+-----+-----+

來源

2017-08-15 11:34:20

與GROUPBY星火篩選數據統計

回答

相關問題