在Spark中獲取值與其滯後之間的差異

我有一個SparkR DataFrame，如下所示。我想創建一個monthdiff列，即dates之間的月份，按name分組。我怎樣才能做到這一點？在Spark中獲取值與其滯後之間的差異

#Set up data frame 
team <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"), 
    dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08', '2017-06-08','2017-07-24','2017-09-05')) 
#Create Spark DataFrame 
team <- createDataFrame(team) 
#Convert dates to date type 
team <- withColumn(team, 'dates', cast(team$dates, 'date'))

這是我到目前爲止已經試過，都導致錯誤：

team <- agg(groupBy(team, 'name'), monthdiff=c(NA, months_between(team$dates, lag(team$dates)))) 
team <- agg(groupBy(team, 'name'), monthdiff=months_between(team$dates, lag(team$dates))) 
team <- agg(groupBy(team, 'name'), monthdiff=months_between(select(team, 'dates'), lag(select(team, 'dates'))))

預期輸出：

name | dates  | monthdiff 
------------------------------- 
Thomas |2017-01-05 | NA 
Thomas |2017-02-23 | 1 
Thomas |2017-03-16 | 1 
Thomas |2017-04-08 | 1 
Bill |2017-06-08 | NA 
Bill |2017-07-24 | 1 
Bill |2017-09-05 | 2

來源

2017-08-14 Gaurav Bansal

在此基礎上post，我適應的代碼SparkR到得到答案。

#Create 'lagdates' variable with lag of dates 
window <- orderBy(windowPartitionBy("name"), team$dates) 
team <- withColumn(team, 'lagdates', over(lag(team$dates), window)) 

#Get months_between dates and lagdates 
team <- withColumn(team, 'monthdiff', round(months_between(team$dates, team$lagdates))) 

name | dates  | lagdates | monthdiff 
------------------------------------------ 
Bill | 2017-06-08 |null  | null 
Bill | 2017-07-24 |2017-06-08 | 2 
Bill | 2017-09-05 |2017-07-24 | 1 
Thomas| 2017-01-05 |null  | null 
Thomas| 2017-02-23 |2017-01-05 | 2 
Thomas| 2017-03-16 |2017-02-23 | 1 
Thomas| 2017-04-08 |2017-03-16 | 1

來源

2017-08-14 17:36:24

在Spark中獲取值與其滯後之間的差異

回答

相關問題