Spark SQL：組內的聚合列值

我需要將列articleId的值聚合到一個數組。這需要在我預先創建的每個groupBy的組內完成。Spark SQL：組內的聚合列值

我的表看起來如下：

| customerId | articleId | articleText | ... 
| 1  |  1  | ...  | ... 
| 1  |  2  | ...  | ... 
| 2  |  1  | ...  | ... 
| 2  |  2  | ...  | ... 
| 2  |  3  | ...  | ...

我想建立類似

| customerId | articleIds | 
| 1  | [1, 2]  | 
| 2  | [1, 2, 3] |

到目前爲止我的代碼：

DataFrame test = dfFiltered.groupBy("CUSTOMERID").agg(dfFiltered.col("ARTICLEID"));

但在這裏我得到一個AnalysisException：

Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'ARTICLEID' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

有人可以幫助建立一個正確的聲明？

來源

2016-07-11 D. Müller

你使用'SQLContext'或'HiveContext'？ –

我正在使用SQLContext ... –

對於SQL語法，當您想按某種方式分組時，您必須在select語句中包含此「something」。也許在你的sparkSQL代碼中，沒有指出這一點。

你有類似的問題，所以我認爲這是SPARK SQL replacement for mysql GROUP_CONCAT aggregate function

來源

2016-07-11 10:29:10

這可以使用collect_list功能來實現你的問題的解決方案，但它是僅當您正在使用HiveContext：

import org.apache.spark.sql.functions._ 

df.groupBy("customerId").agg(collect_list("articleId"))

來源

2016-07-11 10:58:19

Spark SQL：組內的聚合列值

回答

相關問題