與滯後

累計總和我有一個看起來簡單這樣一個非常大的數據集：與滯後

row. member_id entry_id comment_count timestamp 
1  1   a    4   2008-06-09 12:41:00 
2  1   b    1   2008-07-14 18:41:00 
3  1   c    3   2008-07-17 15:40:00 
4  2   d    12   2008-06-09 12:41:00 
5  2   e    50   2008-09-18 10:22:00 
6  3   f    0   2008-10-03 13:36:00

我可以用下面的代碼聚集數：

transform(df, aggregated_count = ave(comment_count, member_id, FUN = cumsum))

但我想的1滯後在累積數據中，或者我想cumsum忽略當前行。結果應該是：

row. member_id entry_id  comment_count timestamp    previous_comments 
1  1   a    4   2008-06-09 12:41:00  0 
2  1   b    1   2008-07-14 18:41:00  4 
3  1   c    3   2008-07-17 15:40:00  5 
4  2   d    12   2008-06-09 12:41:00  0 
5  2   e    50   2008-09-18 10:22:00  12 
6  3   f    0   2008-10-03 13:36:00  0

一些想法如何在R中做到這一點？也許即使有一個比1更大的滯後？

數據重複性：

# dput(df) 
structure(list(member_id = c(1L, 1L, 1L, 2L, 2L, 3L), entry_id = c("a", 
"b", "c", "d", "e", "f"), comment_count = c(4L, 1L, 3L, 12L, 
50L, 0L), timestamp = c("2008-06-09 12:41:00", "2008-07-14 18:41:00", 
"2008-07-17 15:40:00", "2008-06-09 12:41:00", "2008-09-18 10:22:00", 
"2008-10-03 13:36:00")), .Names = c("member_id", "entry_id", 
"comment_count", "timestamp"), row.names = c("1", "2", "3", "4", 
"5", "6"), class = "data.frame")

來源

2014-12-25 Nikolas

好像你已經寫正確的代碼出在一個句子裏，提示提示:) –

可以爲第一要素使用0，並刪除使用head(, -1)

transform(df, previous_comments=ave(comment_count, member_id, 
      FUN = function(x) cumsum(c(0, head(x, -1))))) 
# member_id entry_id comment_count   timestamp previous_comments 
#1   1  a    4 2008-06-09 12:41:00     0 
#2   1  b    1 2008-07-14 18:41:00     4 
#3   1  c    3 2008-07-17 15:40:00     5 
#4   2  d   12 2008-06-09 12:41:00     0 
#5   2  e   50 2008-09-18 10:22:00    12 
#6   3  f    0 2008-10-03 13:36:00     0

來源

2014-12-25 17:17:08 GSee

完美的作品。謝謝你，聖誕快樂:)！ – Nikolas

你可以使用lag從dplyr和改變k

library(dplyr) 
df %>% 
    group_by(member_id) %>% 
    mutate(previous_comments=lag(cumsum(comment_count),k=1, default=0)) 
# member_id entry_id comment_count   timestamp previous_comments 
#1   1  a    4 2008-06-09 12:41:00     0 
#2   1  b    1 2008-07-14 18:41:00     4 
#3   1  c    3 2008-07-17 15:40:00     5 
#4   2  d   12 2008-06-09 12:41:00     0 
#5   2  e   50 2008-09-18 10:22:00    12 
#6   3  f    0 2008-10-03 13:36:00     0

或者使用data.table

library(data.table) 
    setDT(df)[,previous_comments:=c(0,cumsum(comment_count[-.N])) , member_id]

來源

2014-12-25 17:10:26 akrun

感謝@akrun教我關於dplyr的滯後！ \ o/ – Maiasaura

的最後一個元素只是減去comment_count從ave：

transform(df, 
    aggregated_count = ave(comment_count, member_id, FUN = cumsum) - comment_count)

來源

2014-12-25 21:54:10

回答

相關問題