0
下面是一個例子來說明我的問題。Spark SQL:排除PARTITION BY窗口函數中的當前行
在此示例中,我們正在收集每個用戶已購買的其他產品的清單,並將其作爲新列附加到事務表中。 (另請注意,我們正在過濾一些任意列'good_bad'。)
我想知道Spark SQL是否支持在PARTITION BY窗口函數中不包括CURRENT ROW。
例如,交易1將具有other_purchases = [prod2, prod3]
而不是other_purchases = [prod1, prod2, prod3]
。
df = spark.createDataFrame([
(1, "user1", "prod1", "good"),
(2, "user1", "prod2", "good"),
(3, "user1", "prod3", "good"),
(4, "user2", "prod3", "bad"),
(5, "user2", "prod4", "good"),
(5, "user2", "prod5", "good")],
("trans_id", "user_id", "prod_id", "good_bad")
)
df.show()
df = df.selectExpr(
"trans_id",
"user_id",
"COLLECT_LIST(CASE WHEN good_bad == 'good' THEN prod_id END) OVER(PARTITION BY user_id) AS other_purchases"
)
df.show()