(Python)示例將使我的問題清晰明瞭。比方說,我有誰看了在某些日期某些電影人的Spark數據框,如下:來自較早行的累積數組(PySpark數據框)
movierecord = spark.createDataFrame([("Alice", 1, ["Avatar"]),("Bob", 2, ["Fargo", "Tron"]),("Alice", 4, ["Babe"]), ("Alice", 6, ["Avatar", "Airplane"]), ("Alice", 7, ["Pulp Fiction"]), ("Bob", 9, ["Star Wars"])],["name","unixdate","movies"])
架構,並通過上述的外觀所定義的數據幀如下:
root
|-- name: string (nullable = true)
|-- unixdate: long (nullable = true)
|-- movies: array (nullable = true)
| |-- element: string (containsNull = true)
+-----+--------+------------------+
|name |unixdate|movies |
+-----+--------+------------------+
|Alice|1 |[Avatar] |
|Bob |2 |[Fargo, Tron] |
|Alice|4 |[Babe] |
|Alice|6 |[Avatar, Airplane]|
|Alice|7 |[Pulp Fiction] |
|Bob |9 |[Star Wars] |
+-----+--------+------------------+
我喜歡從上面生成一個新的數據幀列,其中包含全部以前的由每個用戶看到的電影,沒有重複(每個unixdate字段爲「上一個」)。所以它應該看起來像這樣:
+-----+--------+------------------+------------------------+
|name |unixdate|movies |previous_movies |
+-----+--------+------------------+------------------------+
|Alice|1 |[Avatar] |[] |
|Bob |2 |[Fargo, Tron] |[] |
|Alice|4 |[Babe] |[Avatar] |
|Alice|6 |[Avatar, Airplane]|[Avatar, Babe] |
|Alice|7 |[Pulp Fiction] |[Avatar, Babe, Airplane]|
|Bob |9 |[Star Wars] |[Fargo, Tron] |
+-----+--------+------------------+------------------------+
我該如何以高效率的方式實現這一點?