0
我有一個SparkR DataFrame
,如下所示。我想創建一個monthdiff
列,即dates
之間的月份,按name
分組。我怎樣才能做到這一點?在Spark中獲取值與其滯後之間的差異
#Set up data frame
team <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"),
dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08', '2017-06-08','2017-07-24','2017-09-05'))
#Create Spark DataFrame
team <- createDataFrame(team)
#Convert dates to date type
team <- withColumn(team, 'dates', cast(team$dates, 'date'))
這是我到目前爲止已經試過,都導致錯誤:
team <- agg(groupBy(team, 'name'), monthdiff=c(NA, months_between(team$dates, lag(team$dates))))
team <- agg(groupBy(team, 'name'), monthdiff=months_between(team$dates, lag(team$dates)))
team <- agg(groupBy(team, 'name'), monthdiff=months_between(select(team, 'dates'), lag(select(team, 'dates'))))
預期輸出:
name | dates | monthdiff
-------------------------------
Thomas |2017-01-05 | NA
Thomas |2017-02-23 | 1
Thomas |2017-03-16 | 1
Thomas |2017-04-08 | 1
Bill |2017-06-08 | NA
Bill |2017-07-24 | 1
Bill |2017-09-05 | 2