2013-07-11 20 views
4

在我的一個應用程序中,有一段代碼根據另一個對象中的值檢索data.table對象的信息。將一個範圍的端點與一個序列合併

# say this table contains customers details 
dt <- data.table(id=LETTERS[1:4], 
       start=seq(as.Date("2010-01-01"), as.Date("2010-04-01"), "month"), 
       end=seq(as.Date("2010-01-01"), as.Date("2010-04-01"), "month") + c(6,8,10,5), 
       key="id") 

# this one has some historical details 
dt1 <- data.table(id=rep(LETTERS[1:4], each=120), 
        date=seq(as.Date("2010-01-01"), as.Date("2010-04-30"), "day"), 
        var=rnorm(120), 
        key="id,date") 

# and here I finally retrieve my historical information based one customer detail 
# 
library(data.table) 

myfunc <- function(x) { 
    # some code 
    period <- seq(x$start, x$end, "day") 
    dt1[.(x$id, period)][, mean(var)] 
    # some code 
} 

以獲取所有的結果我用adply

library(plyr) 
library(microbenchmark) 
> adply(dt, 1, myfunc) 
    id  start  end   V1 
1: A 2010-01-01 2010-01-07 0.3143536 
2: B 2010-02-01 2010-02-09 -0.5796084 
3: C 2010-03-01 2010-03-11 0.1171404 
4: D 2010-04-01 2010-04-06 0.2384237 

> microbenchmark(adply(dt, 1, myfunc)) 
Unit: milliseconds 
       expr  min  lq median  uq  max neval 
adply(dt, 1, myfunc) 8.812486 8.998338 9.105776 9.223637 88.14057 100 

你知道的方式,以避免adply呼叫並執行上述一個data.table聲明?或者無論如何一個更快的方法? (標題編輯建議更受歡迎,我想不出一個更好的,謝謝)

回答

5

這是一個偉大的地方使用roll說法data.table

setkey(dt1, id, date) 
setkey(dt, id, start) 

dt[dt1, roll = TRUE][end >= start, 
    list(start = start[1], end = end[1], result = mean(var)), by = id] 

# benchmark 
microbenchmark(OP = adply(dt, 1, myfunc), 
       Frank = dt[dt1[as.list(dt[,seq.Date(start,end,"day"),by="id"])][,mean(var),by=id]], 
       eddi = dt[dt1, roll = TRUE][end >= start,list(start = start[1], end = end[1], result = mean(var)), by = id]) 
#Unit: milliseconds 
# expr  min  lq median  uq  max neval 
# OP 24.436126 29.184786 30.853094 32.493521 50.898664 100 
# Frank 9.115676 11.303691 12.081000 13.122753 28.370415 100 
# eddi 5.336315 6.323643 6.771898 7.497285 9.531376 100 

的時間差將變得更加劇作家隨着數據集大小的增長。

2

我可以給你一堆嵌套[.data.table電話:

set.seed(1) 
require(data.table) 
# generate dt, dt1 as above 
dt[ 
    dt1[ 
     as.list(dt[,seq.Date(start,end,"day"),by="id"]) 
    ][,mean(var),by=id] 
] 

# id  start  end   V1 
# 1: A 2010-01-01 2010-01-07 0.04475859 
# 2: B 2010-02-01 2010-02-09 -0.01681972 
# 3: C 2010-03-01 2010-03-11 0.39791318 
# 4: D 2010-04-01 2010-04-06 0.77854732 

我使用as.list取消設置鍵。我不知道是否有比這更好的辦法...

require(microbenchmark) 
require(plyr) 
microbenchmark(
    adply=adply(dt, 1, myfunc), 
    dtdtdt= dt[dt1[as.list(dt[,seq.Date(start,end,"day"),by="id"])][,mean(var),by=id]] 
) 

# Unit: milliseconds 
# expr  min  lq median  uq  max neval 
# adply 12.987334 13.247374 13.477386 14.371258 18.362505 100 
# dtdtdt 4.854708 4.944596 4.993678 5.233507 7.082461 100 

編輯:(EDDI)替代以上,將需要少一個的合併(如在評論中討論):

setkey(dt, NULL) 

dt1[dt[, list(seq.Date(start,end,"day"), end), by=id]][, 
    list(start = date[1], end = end[1], result = mean(var)), by = id] 
# or 
dt1[dt[, seq.Date(start,end,"day"), by=id]][, 
    list(start = date[1], end = date[.N], result = mean(var)), by = id] 
+1

如果除了序列之外還返回第一個'[]'中的'end',那麼您可以更清楚一點,那麼您不需要進行最後的合併。另一種選擇是從'by'計算'end'。無論哪種情況,您都可以在開始之前通過取消設置'dt'的密鑰來清除表達式中的'as.list'或'setkey'。 – eddi

+0

@eddi和序列一起返回'end'應該與你的第一個'[]'相同'nomatch = 0'innit?無論如何,謝謝你們,你們**非常有幫助! – Michele

+0

@Michele,我不確定你說的是哪一步,但是在某些時候他們確實收斂:) – eddi

相關問題