我需要將列articleId
的值聚合到一個數組。這需要在我預先創建的每個groupBy
的組內完成。Spark SQL:組內的聚合列值
我的表看起來如下:
| customerId | articleId | articleText | ...
| 1 | 1 | ... | ...
| 1 | 2 | ... | ...
| 2 | 1 | ... | ...
| 2 | 2 | ... | ...
| 2 | 3 | ... | ...
我想建立類似
| customerId | articleIds |
| 1 | [1, 2] |
| 2 | [1, 2, 3] |
到目前爲止我的代碼:
DataFrame test = dfFiltered.groupBy("CUSTOMERID").agg(dfFiltered.col("ARTICLEID"));
但在這裏我得到一個AnalysisException
:
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'ARTICLEID' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
有人可以幫助建立一個正確的聲明?
你使用'SQLContext'或'HiveContext'? –
我正在使用SQLContext ... –