2017-08-14 63 views
0

我有一個SparkR DataFrame,如下所示。我想創建一個monthdiff列,即dates之間的月份,按name分組。我怎樣才能做到這一點?在Spark中獲取值與其滯後之間的差異

#Set up data frame 
team <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"), 
    dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08', '2017-06-08','2017-07-24','2017-09-05')) 
#Create Spark DataFrame 
team <- createDataFrame(team) 
#Convert dates to date type 
team <- withColumn(team, 'dates', cast(team$dates, 'date')) 

這是我到目前爲止已經試過,都導致錯誤:

team <- agg(groupBy(team, 'name'), monthdiff=c(NA, months_between(team$dates, lag(team$dates)))) 
team <- agg(groupBy(team, 'name'), monthdiff=months_between(team$dates, lag(team$dates))) 
team <- agg(groupBy(team, 'name'), monthdiff=months_between(select(team, 'dates'), lag(select(team, 'dates')))) 

預期輸出:

name | dates  | monthdiff 
------------------------------- 
Thomas |2017-01-05 | NA 
Thomas |2017-02-23 | 1 
Thomas |2017-03-16 | 1 
Thomas |2017-04-08 | 1 
Bill |2017-06-08 | NA 
Bill |2017-07-24 | 1 
Bill |2017-09-05 | 2 

回答

0

在此基礎上post,我適應的代碼SparkR到得到答案。

#Create 'lagdates' variable with lag of dates 
window <- orderBy(windowPartitionBy("name"), team$dates) 
team <- withColumn(team, 'lagdates', over(lag(team$dates), window)) 

#Get months_between dates and lagdates 
team <- withColumn(team, 'monthdiff', round(months_between(team$dates, team$lagdates))) 

name | dates  | lagdates | monthdiff 
------------------------------------------ 
Bill | 2017-06-08 |null  | null 
Bill | 2017-07-24 |2017-06-08 | 2 
Bill | 2017-09-05 |2017-07-24 | 1 
Thomas| 2017-01-05 |null  | null 
Thomas| 2017-02-23 |2017-01-05 | 2 
Thomas| 2017-03-16 |2017-02-23 | 1 
Thomas| 2017-04-08 |2017-03-16 | 1