我該如何在R中的連續的行中做一個滾動的cumsum

我有一個玩具的例子。什麼是總結通過X分組Y的兩個連續行我該如何在R中的連續的行中做一個滾動的cumsum

library(tibble) 
l = list(x = c("a", "b", "a", "b", "a", "b"), y = c(1, 4, 3, 3, 7, 0)) 

df <- as_tibble(l) 
df 
#> # A tibble: 6 x 2 
#>  x  y 
#> <chr> <dbl> 
#> 1  a  1 
#> 2  b  4 
#> 3  a  3 
#> 4  b  3 
#> 5  a  7 
#> 6  b  0

所以輸出會是這樣的

group sum seq 
    a  4  1 
    a  10  2 
    b  7  1 
    b  3  2

我想最有效的方法請使用RcppRoll包中的tidyverse和可能的roll_sum（），並使用代碼，以便可變長度的連續行可用於真實世界的數據，其中將會有很多組

TIA

來源

2017-09-27 pssguy

的一種方法是使用group_by %>% do在這裏你可以自定義do返回的數據幀：

library(RcppRoll); library(tidyverse) 

n = 2 
df %>% 
    group_by(x) %>% 
    do(
     data.frame(
      sum = roll_sum(.$y, n), 
      seq = seq_len(length(.$y) - n + 1) 
     ) 
    ) 

# A tibble: 4 x 3 
# Groups: x [2] 
#  x sum seq 
# <chr> <dbl> <int> 
#1  a  4  1 
#2  a 10  2 
#3  b  7  1 
#4  b  3  2

編輯：由於這是效率不高，可能是由於數據幀構建頭和綁定數據幀在旅途中，這裏是一個改進版本（仍然比data.table慢一些，但現在沒有那麼多）：

df %>% 
    group_by(x) %>% 
    summarise(sum = list(roll_sum(y, n)), seq = list(seq_len(n() -n + 1))) %>% 
    unnest()

時序，使用@馬特的數據和設置：

library(tibble) 
library(dplyr) 
library(RcppRoll) 
library(stringi) ## Only included for ability to generate random strings 

## Generate data with arbitrary number of groups and rows -------------- 
rowCount <- 100000 
groupCount <- 10000 
sumRows <- 2L 
set.seed(1) 

l <- tibble(x = sample(stri_rand_strings(groupCount,3),rowCount,rep=TRUE), 
      y = sample(0:10,rowCount,rep=TRUE)) 

## Using dplyr and tibble ----------------------------------------------- 

ptm <- proc.time() ## Start the clock 

dplyr_result <- l %>% 
    group_by(x) %>% 
    summarise(sum = list(roll_sum(y, n)), seq = list(seq_len(n() -n + 1))) %>% 
    unnest() 


dplyr_time <- proc.time() - ptm ## Stop the clock 

## Using data.table instead ---------------------------------------------- 

library(data.table) 

ptm <- proc.time() ## Start the clock 

setDT(l) ## Convert l to a data.table 
dt_result <- l[,.(sum = RcppRoll::roll_sum(y, n = sumRows, fill = NA, align = "left"), 
        seq = seq_len(.N)), 
       keyby = .(x)][!is.na(sum)] 

data.table_time <- proc.time() - ptm

結果是：

dplyr_time 
# user system elapsed 
# 0.688 0.003 0.689 
data.table_time 
# user system elapsed 
# 0.422 0.009 0.430

來源

2017-09-27 01:53:16 Psidom

這是您的一種方法。既然你想總結兩個連續的行，你可以使用lead()併爲sum做計算。對於seq，我想你可以簡單地拿行數，看看你的預期結果。完成這些操作後，您可以按照x（如有必要，請按照x和seq）安排您的數據。最後，你用NAs刪除行。如有必要，您可以在代碼末尾寫入select(-y)以刪除y。要做到這一點

group_by(df, x) %>% 
mutate(sum = y + lead(y), 
     seq = row_number()) %>% 
arrange(x) %>% 
ungroup %>% 
filter(complete.cases(.)) 

#  x  y sum seq 
# <chr> <dbl> <dbl> <int> 
#1  a  1  4  1 
#2  a  3 10  2 
#3  b  4  7  1 
#4  b  3  3  2

來源

2017-09-27 01:51:37 jazzurro

使用tidyverse和zoo溶液。這與Psidom的方法類似。

library(tidyverse) 
library(zoo) 

df2 <- df %>% 
    group_by(x) %>% 
    do(data_frame(x = unique(.$x), 
       sum = rollapplyr(.$y, width = 2, FUN = sum))) %>% 
    mutate(seq = 1:n()) %>% 
    ungroup() 
df2 
# A tibble: 4 x 3 
     x sum seq 
    <chr> <dbl> <int> 
1  a  4  1 
2  a 10  2 
3  b  7  1 
4  b  3  2

來源

2017-09-27 02:00:29 www

一個錯字:)'rollapply' – Wen

@Wen謝謝。 'rollapplyr'也適用。默認對齊方式設置爲「正確」。這就是爲什麼它被稱爲'rollapplyr'。 – www

upvoted我愚蠢的問題，並學習新的東西:) – Wen

zoo + dplyr

library(zoo) 
library(dplyr) 

df %>% 
    group_by(x) %>% 
    mutate(sum = c(NA, rollapply(y, width = 2, sum)), 
      seq = row_number() - 1) %>% 
    drop_na() 

# A tibble: 4 x 4 
# Groups: x [2] 
     x  y sum seq 
    <chr> <dbl> <dbl> <dbl> 
1  a  3  4  1 
2  b  3  7  1 
3  a  7 10  2 
4  b  0  3  2

如果移動窗口只等於2使用lag

df %>% 
    group_by(x) %>% 
    mutate(sum = y + lag(y), 
    seq = row_number() - 1) %>% 
    drop_na() 
# A tibble: 4 x 4 
# Groups: x [2] 
     x  y sum seq 
    <chr> <dbl> <dbl> <dbl> 
1  a  3  4  1 
2  b  3  7  1 
3  a  7 10  2 
4  b  0  3  2

編輯：

n = 3 # your moving window 
df %>% 
    group_by(x) %>% 
    mutate(sum = c(rep(NA, n - 1), rollapply(y, width = n, sum)), 
      seq = row_number() - n + 1) %>% 
    drop_na()

來源

2017-09-27 02:33:30 Wen

YesI之前使用過滯後方法，但一旦過去3它很難看 – pssguy

@pssguy是的，你是對的。我突出顯示當你的移動窗口是2時，你可以使用'lag'或'shift' – Wen

當應用於序列長度爲17的實際數據時，我得到錯誤列'sum'必須是長度32（組大小）或一個，而不是其他解決方案不會發生的17。任何想法爲什麼？它似乎很快，否則 – pssguy

我注意到你的ked爲效率最高的方式 - 如果您正在考慮擴展到更大的集合，我會強烈建議data.table。

library(data.table) 
library(RcppRoll) 

l[, .(sum = RcppRoll::roll_sum(y, n = 2L, fill = NA, align = "left"), 
     seq = seq_len(.N)), 
    keyby = .(x)][!is.na(sum)]

這方面的一個粗略的基準比較VS使用tidyverse包10萬行和10,000個組的回答說明瞭顯著差異。

（我用Psidom的答案，而不是jazzurro的，因爲jazzuro的不容許被概括行的arbritary號）。

library(tibble) 
library(dplyr) 
library(RcppRoll) 
library(stringi) ## Only included for ability to generate random strings 

## Generate data with arbitrary number of groups and rows -------------- 
rowCount <- 100000 
groupCount <- 10000 
sumRows <- 2L 
set.seed(1) 

l <- tibble(x = sample(stri_rand_strings(groupCount,3),rowCount,rep=TRUE), 
      y = sample(0:10,rowCount,rep=TRUE)) 

## Using dplyr and tibble ----------------------------------------------- 

ptm <- proc.time() ## Start the clock 

dplyr_result <- l %>% 
    group_by(x) %>% 
    do(
     data.frame(
      sum = roll_sum(.$y, sumRows), 
      seq = seq_len(length(.$y) - sumRows + 1) 
     ) 
    ) 
|========================================================0% ~0 s remaining  

dplyr_time <- proc.time() - ptm ## Stop the clock 

## Using data.table instead ---------------------------------------------- 

library(data.table) 

ptm <- proc.time() ## Start the clock 

setDT(l) ## Convert l to a data.table 
dt_result <- l[,.(sum = RcppRoll::roll_sum(y, n = sumRows, fill = NA, align = "left"), 
        seq = seq_len(.N)), 
       keyby = .(x)][!is.na(sum)] 

data.table_time <- proc.time() - ptm ## Stop the clock

結果：

> dplyr_time 
    user system elapsed 
    10.28 0.04 10.36 
> data.table_time 
    user system elapsed 
    0.35 0.02 0.36 

> all.equal(dplyr_result,as.tibble(dt_result)) 
[1] TRUE

來源

2017-09-27 12:50:03

是的，這看起來確實是最好的方法。我傾向於使用較小的數據集和較少的密集處理，但使用這個數據集時，它有250,000行和2,500個組，與您的示例具有可比性。我有更大的時間差距，我已經給@Psidom提供了答案，因爲我特別提到了tidyverse，但會在生產中使用你的 – pssguy

現有的答案一個小的變體：首先將數據轉換爲列表格式，然後使用purrr至map()roll_sum()到數據上。

l = list(x = c("a", "b", "a", "b", "a", "b"), y = c(1, 4, 3, 3, 7, 0)) 
as.tibble(l) %>% 
    group_by(x) %>% 
    summarize(list_y = list(y)) %>% 
    mutate(rollsum = map(list_y, ~roll_sum(.x, 2))) %>% 
    select(x, rollsum) %>% 
    unnest %>% 
    group_by(x) %>% 
    mutate(seq = row_number())

我覺得如果你有最新版本的purrr您可以通過使用imap()而不是地圖擺脫最後兩行（最終group_by()和mutate()）的。

來源

2017-11-07 07:07:55

我該如何在R中的連續的行中做一個滾動的cumsum

回答

相關問題