2014-12-25 94 views
4

累計總和我有一個看起來簡單這樣一個非常大的數據集:與滯後

row. member_id entry_id comment_count timestamp 
1  1   a    4   2008-06-09 12:41:00 
2  1   b    1   2008-07-14 18:41:00 
3  1   c    3   2008-07-17 15:40:00 
4  2   d    12   2008-06-09 12:41:00 
5  2   e    50   2008-09-18 10:22:00 
6  3   f    0   2008-10-03 13:36:00 

我可以用下面的代碼聚集​​數:

transform(df, aggregated_count = ave(comment_count, member_id, FUN = cumsum)) 

但我想的1滯後在累積數據中,或者我想cumsum忽略當前行。結果應該是:

row. member_id entry_id  comment_count timestamp    previous_comments 
1  1   a    4   2008-06-09 12:41:00  0 
2  1   b    1   2008-07-14 18:41:00  4 
3  1   c    3   2008-07-17 15:40:00  5 
4  2   d    12   2008-06-09 12:41:00  0 
5  2   e    50   2008-09-18 10:22:00  12 
6  3   f    0   2008-10-03 13:36:00  0 

一些想法如何在R中做到這一點?也許即使有一個比1更大的滯後?


數據重複性:

# dput(df) 
structure(list(member_id = c(1L, 1L, 1L, 2L, 2L, 3L), entry_id = c("a", 
"b", "c", "d", "e", "f"), comment_count = c(4L, 1L, 3L, 12L, 
50L, 0L), timestamp = c("2008-06-09 12:41:00", "2008-07-14 18:41:00", 
"2008-07-17 15:40:00", "2008-06-09 12:41:00", "2008-09-18 10:22:00", 
"2008-10-03 13:36:00")), .Names = c("member_id", "entry_id", 
"comment_count", "timestamp"), row.names = c("1", "2", "3", "4", 
"5", "6"), class = "data.frame") 
+1

好像你已經寫正確的代碼出在一個句子裏,提示提示:) –

回答

9

可以爲第一要素使用0,並刪除使用head(, -1)

transform(df, previous_comments=ave(comment_count, member_id, 
      FUN = function(x) cumsum(c(0, head(x, -1))))) 
# member_id entry_id comment_count   timestamp previous_comments 
#1   1  a    4 2008-06-09 12:41:00     0 
#2   1  b    1 2008-07-14 18:41:00     4 
#3   1  c    3 2008-07-17 15:40:00     5 
#4   2  d   12 2008-06-09 12:41:00     0 
#5   2  e   50 2008-09-18 10:22:00    12 
#6   3  f    0 2008-10-03 13:36:00     0 
+0

完美的作品。謝謝你,聖誕快樂:)! – Nikolas

8

你可以使用lagdplyr和改變k

library(dplyr) 
df %>% 
    group_by(member_id) %>% 
    mutate(previous_comments=lag(cumsum(comment_count),k=1, default=0)) 
# member_id entry_id comment_count   timestamp previous_comments 
#1   1  a    4 2008-06-09 12:41:00     0 
#2   1  b    1 2008-07-14 18:41:00     4 
#3   1  c    3 2008-07-17 15:40:00     5 
#4   2  d   12 2008-06-09 12:41:00     0 
#5   2  e   50 2008-09-18 10:22:00    12 
#6   3  f    0 2008-10-03 13:36:00     0 

或者使用data.table

library(data.table) 
    setDT(df)[,previous_comments:=c(0,cumsum(comment_count[-.N])) , member_id] 
+0

感謝@akrun教我關於dplyr的滯後! \ o/ – Maiasaura

4

的最後一個元素只是減去comment_countave

transform(df, 
    aggregated_count = ave(comment_count, member_id, FUN = cumsum) - comment_count)