2017-08-09 62 views
1

基本上,我需要查看時間戳記值列表的累積分鐘數。Spark - 以分鐘爲單位的累積時間戳值

Timestamp    cum 
2017-06-04 02:58:00, 0 
2017-06-04 03:02:00, 4 
2017-06-04 03:05:00, 7 
2017-06-04 03:10:00, 12 

這是我工作的想法:

from pyspark.sql import Window as W 

windowSpec =W.partitionBy(A["userid"]).orderBy(A["eventtime"]) 
acumEventTime = F.sum(col("eventtime")).over(windowSpec) 
A.select("userid","eventtime", acumEventTime.alias("acumEventTime")) 

我有總和去戳了一個窗口,它給了我在acumEventTime場以下值:

acumEventTime 
2.9930904E9, 
1.4965452E9, 
1.4965452E9, 
1.4965452E9, 
2.9930904E9 

是否有任何有效的方式來顯示只有幾分鐘?

回答

1

給出的描述,我寧願結合lagsum

from pyspark.sql.functions import col, coalesce, lag, lit, sum 
from pyspark.sql.window import Window 

df = (spark.createDataFrame([ 
    (1, "2017-06-04 02:58:00"), 
    (1, "2017-06-04 03:02:00"), 
    (1, "2017-06-04 03:05:00"), 
    (1, "2017-06-04 03:10:00"), 
]) 
.toDF("userid", "eventtime") 
.withColumn("eventtime", col("eventtime").cast("timestamp"))) 

w = Window.partitionBy("userid").orderBy("eventtime") 

cum = (sum(coalesce(
    col("eventtime").cast("long") - lag("eventtime", 1).over(w).cast("long"), 
    lit(0) 
)).over(w)/60).cast("long") 

df.withColumn("cum", cum).show() 

+------+-------------------+---+ 
|userid|   eventtime|cum| 
+------+-------------------+---+ 
|  1|2017-06-04 02:58:00| 0| 
|  1|2017-06-04 03:02:00| 4| 
|  1|2017-06-04 03:05:00| 7| 
|  1|2017-06-04 03:10:00| 12| 
+------+-------------------+---+ 
+0

就像一個魅力的工作! – ebertbm

相關問題