我有這個json數據,我想在總結'b''&'a'中的數據時每小時收集'時間戳'列。pyspark更改在該列上使用groupby之前的列的值
{"a":1 , "b":1, "timestamp":"2017-01-26T01:14:55.719214Z"}
{"a":1 , "b":1,"timestamp":"2017-01-26T01:14:55.719214Z"}
{"a":1 , "b":1,"timestamp":"2017-01-26T02:14:55.719214Z"}
{"a":1 , "b":1,"timestamp":"2017-01-26T03:14:55.719214Z"}
這是我想
{"a":2 , "b":2, "timestamp":"2017-01-26T01:00:00"}
{"a":1 , "b":1,"timestamp":"2017-01-26T02:00:00"}
{"a":1 , "b":1,"timestamp":"2017-01-26T03:00:00"}
這最後的輸出是我至今寫
df = spark.read.json(inputfile)
df2 = df.groupby("timestamp").agg(f.sum(df["a"],f.sum(df["b"])
但我應如何改變「時間戳」列的前值使用groupby函數?提前致謝!
此[答案](http://stackoverflow.com/a/34232633/2708667)可以是有幫助的。它顯示瞭如何四捨五入解析時間戳記對象。 – santon