2013-08-26 139 views
4

是否有一個使用data.table分組計算滾動統計信息的現有習慣用法?R data.table與rollapply

例如,假設下面的代碼:

DT = data.table(x=rep(c("a","b","c"),each=2), y=c(1,3), v=1:6) 
setkey(DT, y) 

stat.ror <- DT[,rollapply(v, width=1, by=1, mean, na.rm=TRUE), by=y]; 

如果沒有一個呢,會是什麼做的最好方法是什麼?

+2

有什麼問題嗎? –

+0

我想在data.table中有'roll'的一些功能,但也許你打算用動物園標記這個? – Frank

+1

這裏沒有SO問題,這是開放的辯論(不是什麼樣的)。但是'data.table'已經實現了一個非常快速的'roll'參數,可以幫助做roll-join,window-join等幾個國王... – statquant

回答

3

其實我現在正在努力解決這個問題。這裏是一個 部分解決方案,可通過單個列分組工作:

編輯:與RcppRoll得到了它,我想:

windowed.average <- function(input.table, 
          window.width = 2, 
          id.cols = names(input.table)[3], 
          index.col = names(input.table)[1], 
          val.col = names(input.table)[2]) { 
    require(RcppRoll) 

    avg.with.group <- 
    input.table[,roll_mean(get(val.col), n = window.width),by=c(id.cols)] 
    avg.index <- 
    input.table[,roll_mean(get(index.col), n = window.width),by=c(id.cols)]$V1 

    output.table <- data.table(
    Group = avg.with.group, 
    Index = avg.index) 

    # rename columns to (sensibly) match inputs 
    setnames(output.table, old=colnames(output.table), 
      new = c(id.cols,val.col,index.col)) 

    return(output.table) 
} 

A(寫的不好),單元測試會通過上面的:

require(testthat) 
require(zoo) 
test.datatable <- data.table(Time = rep(seq_len(10), times=2), 
          Voltage = runif(20), 
          Channel= rep(seq_len(2),each=10)) 
test.width <- 8 

# first test: single id column 
test.avgtable <- data.table(
    test.datatable[,rollapply(Voltage, width = test.width, mean, na.rm=TRUE), 
         by=c("Channel")], 
    Time = test.datatable[,rollapply(Time, width = test.width, mean, na.rm=TRUE), 
         by=c("Channel")]$V1) 
setnames(test.avgtable,old=names(test.avgtable), 
     new=c("Channel","Voltage","Time")) 

expect_that(test.avgtable, 
      is_identical_to(windowed.average(test.datatable,test.width))) 

如何看起來:

> test.datatable 
    Time  Voltage Channel Class 
1: 1 0.310935570  1  1 
2: 2 0.565257533  1  2 
3: 3 0.577278573  1  1 
4: 4 0.152315111  1  2 
5: 5 0.836052122  1  1 
6: 6 0.655417230  1  2 
7: 7 0.034859642  1  1 
8: 8 0.572040136  1  2 
9: 9 0.268105436  1  1 
10: 10 0.126484340  1  2 
11: 1 0.139711248  2  1 
12: 2 0.336316520  2  2 
13: 3 0.413086486  2  1 
14: 4 0.304146029  2  2 
15: 5 0.399344631  2  1 
16: 6 0.581641210  2  2 
17: 7 0.183586025  2  1 
18: 8 0.009775488  2  2 
19: 9 0.449576242  2  1 
20: 10 0.938517952  2  2 

> test.avgtable 
    Channel Voltage Time 
1:  1 0.4630195 4.5 
2:  1 0.4576657 5.5 
3:  1 0.4028191 6.5 
4:  2 0.2959510 4.5 
5:  2 0.3346841 5.5 
6:  2 0.4099593 6.5 

不幸的是,我沒有設法使其與多個分組工作(因爲這第二部分顯示):

看起來不錯了多列組:

# second test: multiple id columns 
# Depends on the first test passing to be meaningful. 
test.width <- 4 
test.datatable[,Class:= rep(seq_len(2),times=ceiling(nrow(test.datatable)/2))] 
# windowed.average(test.datatable,test.width,id.cols=c("Channel","Class")) 
test.avgtable <- rbind(windowed.average(test.datatable[Class==1,],test.width), 
         windowed.average(test.datatable[Class==2,],test.width)) 
# somewhat artificially attaching expected class labels 
test.avgtable[,Class:= rep(seq_len(2),times=nrow(test.avgtable)/4,each=2)] 
setkey(test.avgtable,Channel) 
setcolorder(test.avgtable,c("Channel","Class","Voltage","Time")) 

expect_that(test.avgtable, 
      is_equivalent_to(windowed.average(test.datatable,test.width, 
              id.cols=c("Channel","Class"))))