2017-09-01 62 views
0

我有具有以下結構的數據的大數據幀:滾動軸承累計總和和滯後不限於滯後範圍

name  date val1 val2 
1  A 2017-01-01 0 2 
2  A 2017-01-02 1 1 
3  A 2017-01-03 1 0 
4  A 2017-01-04 0 3 
5  A 2017-01-05 1 1 
6  A 2017-01-06 0 0 
7  B 2017-01-01 0 0 
8  B 2017-01-02 0 3 
9  B 2017-01-03 1 2 
10 B 2017-01-04 1 1 
11 B 2017-01-05 0 0 
12 B 2017-01-06 1 0 
13 C 2017-01-01 0 2 
14 C 2017-01-02 0 1 
15 C 2017-01-03 1 2 
16 C 2017-01-04 0 0 
17 C 2017-01-05 0 0 
18 C 2017-01-06 1 3 

對於任何date每組name內,現在,我想以計算cumsum()val1爲最後2次出現,而val2爲最後3次出現。

我用下面的代碼(基於這樣的回答:https://stackoverflow.com/a/27649238/1162278;含創建樣本數據集):嘗試這種

library(dplyr) 
library(data.table) 

dates <- seq(as.Date('2017-01-01'), as.Date('2017-01-06'), by = '1 day') 

d <- CJ(
    name = c('A', 'B', 'C'), 
    date = dates 
) %>% 
    left_join(
    data.frame(
     name = c(rep('A',6), rep('B',6), rep('C',6)), 
     date = c(rep(dates, 3)), 
     val1 = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,1,0,0,1), 
     val2 = c(2,1,0,3,1,0,0,3,2,1,0,0,2,1,2,0,0,3) 
    ) 
) 


d %>% 
    group_by(name) %>% 
    mutate(
    val1_l2 = dplyr::lag(cumsum(val1), k=2), 
    val2_l3 = dplyr::lag(cumsum(val2), k=3) 
) 

這產生了:

name  date val1 val2 val1_l2 val2_l3 
    <chr>  <date> <dbl> <dbl> <dbl> <dbl> 
1  A 2017-01-01  0  2  NA  NA 
2  A 2017-01-02  1  1  0  2 
3  A 2017-01-03  1  0  1  3 
4  A 2017-01-04  0  3  2  3 
5  A 2017-01-05  1  1  2  6 
6  A 2017-01-06  0  0  3  7 
7  B 2017-01-01  0  0  NA  NA 
8  B 2017-01-02  0  3  0  0 
9  B 2017-01-03  1  2  0  3 
10  B 2017-01-04  1  1  1  5 
11  B 2017-01-05  0  0  2  6 
12  B 2017-01-06  1  0  2  6 
13  C 2017-01-01  0  2  NA  NA 
14  C 2017-01-02  0  1  0  2 
15  C 2017-01-03  1  2  0  3 
16  C 2017-01-04  0  0  1  5 
17  C 2017-01-05  0  0  1  5 
18  C 2017-01-06  1  3  1  5 

然而,似乎類似於cumsum()總是針對name組內的所有以前的記錄進行計算,而不是針對滾動範圍k=2k=3對於val1val2

例子:

Row Variable Calculated Expected 
    5 val1_l2  2   1 
    5 val2_l3  6   4 

我在做什麼錯?

+1

我不清楚 – Sotos

+0

不應'val2_l3'在5行根據你的邏輯爲4(3 + 0 + 1),而不是4? – count

+0

事實上,它應該,道歉和感謝指出。我在問題中糾正了它。 –

回答

0

我們可能不需要在這裏使用lag。除最後兩行或三行外,我們可以將所有值替換爲0,然後使用cumsum。這是一個例子。請注意0​​是最終輸出。 n():(n() - 1)n():(n() - 2)表示最後兩行或三行。 ifelse(row_number() %in% ...)檢查行號是否與最後兩行或三行匹配。

d2 <- d %>% 
    group_by(name) %>% 
    mutate(val1_l2 = ifelse(row_number() %in% n():(n() - 1), val1, 0), 
     val2_l3 = ifelse(row_number() %in% n():(n() - 2), val2, 0)) %>% 
    mutate(val1_l2 = cumsum(val1_l2), 
     val2_l3 = cumsum(val2_l3)) 

d2 
# A tibble: 18 x 6 
# Groups: name [3] 
    name  date val1 val2 val1_l2 val2_l3 
    <chr>  <date> <dbl> <dbl> <dbl> <dbl> 
1  A 2017-01-01  0  2  0  0 
2  A 2017-01-02  1  1  0  0 
3  A 2017-01-03  1  0  0  0 
4  A 2017-01-04  0  3  0  3 
5  A 2017-01-05  1  1  1  4 
6  A 2017-01-06  0  0  1  4 
7  B 2017-01-01  0  0  0  0 
8  B 2017-01-02  0  3  0  0 
9  B 2017-01-03  1  2  0  0 
10  B 2017-01-04  1  1  0  1 
11  B 2017-01-05  0  0  0  1 
12  B 2017-01-06  1  0  1  1 
13  C 2017-01-01  0  2  0  0 
14  C 2017-01-02  0  1  0  0 
15  C 2017-01-03  1  2  0  0 
16  C 2017-01-04  0  0  0  0 
17  C 2017-01-05  0  0  0  0 
18  C 2017-01-06  1  3  1  3 

數據

library(dplyr) 
library(data.table) 

dates <- seq(as.Date('2017-01-01'), as.Date('2017-01-06'), by = '1 day') 

d <- CJ(
    name = c('A', 'B', 'C'), 
    date = dates 
) %>% 
    left_join(
    data.frame(
     name = c(rep('A',6), rep('B',6), rep('C',6)), 
     date = c(rep(dates, 3)), 
     val1 = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,1,0,0,1), 
     val2 = c(2,1,0,3,1,0,0,3,2,1,0,0,2,1,2,0,0,3) 
    ) 
)