如何在Spark/Scala中使用countDistinct？

我試圖用聚集在斯卡拉火花數據幀一列，像這樣：如何在Spark/Scala中使用countDistinct？

import org.apache.spark.sql._ 

dfNew.agg(countDistinct("filtered"))

，但我得到的錯誤：

error: value agg is not a member of Unit

任何人都可以解釋，爲什麼？

編輯：澄清我在做什麼：我有一個字符串數組的列，我想統計所有行上的不同元素，對其他列沒有興趣。數據：

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ 
|racist|filtered                                      | 
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ 
|false |[rt, @dope_promo:, crew, beat, high, scores, fugly, frog, , https://time.com/sxp3onz1w8]                  | 
|false |[rt, @axolrose:, yall, call, kermit, frog, lizard?, , https://time.com/wdaeaer1ay]                    |

而且我想算過濾，贈送：

rt:2, @dope_promo:1, crew:1, ...frog:2 etc

來源

2017-07-03 schoon

對於聚合函數，您需要首先應用groupBy。這可以幫助你https://stackoverflow.com/questions/33500816/how-to-use-countdistinct-in-scala-with-spark –

可能的重複[如何在Scala中使用countDistinct與Spark？]（https：///stackoverflow.com/questions/33500816/how-to-use-countdistinct-in-scala-with-spark） –

好吧，也許我試圖使用錯誤的功能。我有一個字符串是一個字符串數組，我想統計所有行的不同元素，對其他列沒有興趣。我將編輯我的問題來反映這一點。 – schoon

您需要首先explode您的陣列之前，你可以指望出現次數：查看每個元素的計數：

dfNew 
.withColumn("filtered",explode($"filtered")) 
.groupBy($"filtered") 
.count 
.orderBy($"count".desc) 
.show

或只是爲了得到不同元素的計數：

val count = dfNew 
.withColumn("filtered",explode($"filtered")) 
.select($"filtered") 
.distinct 
.count

來源

2017-07-03 19:18:40

如何在Spark/Scala中使用countDistinct？

回答

相關問題