我有一個SparkR數據幀組的最後一個值,如下所示:獲得在星火
#Create R data.frame
custId <- c(rep(1001, 5), rep(1002, 3), 1003)
date <- c('2013-08-01','2014-01-01','2014-02-01','2014-03-01','2014-04-01','2014-02-01','2014-03-01','2014-04-01','2014-04-01')
desc <- c('New','New','Good','New', 'Bad','New','Good','Good','New')
newcust <- c(1,1,0,1,0,1,0,0,1)
df <- data.frame(custId, date, desc, newcust)
#Create SparkR DataFrame
df <- createDataFrame(df)
display(df)
custId| date | desc | newcust
--------------------------------------
1001 | 2013-08-01| New | 1
1001 | 2014-01-01| New | 1
1001 | 2014-02-01| Good | 0
1001 | 2014-03-01| New | 1
1001 | 2014-04-01| Bad | 0
1002 | 2014-02-01| New | 1
1002 | 2014-03-01| Good | 0
1002 | 2014-04-01| Good | 0
1003 | 2014-04-01| New | 1
newcust
預示着新客戶的每一個新的custId
出現一次,或者是否同custId
的desc
恢復到'新」。我想獲得的是newcust
的每個分組的最後desc
值,同時保持每個分組的第一個date
。下面是我想要獲得的DataFrame。我如何在Spark中做到這一點? PySpark或SparkR代碼都可以工作。
#What I want
custId| date | newcust | finaldesc
----------------------------------------------
1001 | 2013-08-01| 1 | New
1001 | 2014-01-01| 1 | Good
1001 | 2014-03-01| 1 | Bad
1002 | 2014-02-01| 1 | Good
1003 | 2014-04-01| 1 | New