2012-05-23 28 views
2

此代碼生成類似於我自己的數據集:如何加快此_for_循環?用data.table + lapply?


df <- c(seq(as.Date("2012-01-01"), as.Date("2012-01-10"), "days")) 
    df <- as.data.frame(df) 
    df <- rbind(df, df) 

id <- c(rep.int(1, 10), rep.int(2, 10)) 
    id <- as.data.frame(id) 

cnt <- c(1:3, 0, 0, 4, 5:8, 0, 1, 0, 1:7) 
    cnt <- as.data.frame(cnt) 

df <- cbind(id, df, cnt) 
    names(df) <- c("id", "date", "cnt") 

df$date[df$date == "2012-01-10"] <- "2012-01-20" 

我試圖找到內已經過去7天內發生的變量「CNT」的總和。有時日期不連續(請參閱前面的'df'中的最後一個日期) - 按id。

這裏的循環:


system.time(

    for(i in 1:length(df$date)) { 
    df$cnt.weekly[i] <- 
     sum(df$cnt[which((df$date == df$date[i] - 1) & df$id == df$id[i])], 
      df$cnt[which((df$date == df$date[i] - 2) & df$id == df$id[i])], 
      df$cnt[which((df$date == df$date[i] - 3) & df$id == df$id[i])], 
      df$cnt[which((df$date == df$date[i] - 4) & df$id == df$id[i])], 
      df$cnt[which((df$date == df$date[i] - 5) & df$id == df$id[i])], 
      df$cnt[which((df$date == df$date[i] - 6) & df$id == df$id[i])])}) 

我最終在一個800萬行data.frame運行這個(千IDS),因此而玩具是快這裏是在實踐中非常緩慢。

我已經在代碼的其他部分data.table包非常好運,但我無法弄清楚如何讓它在這裏工作。可能在data.table中使用?

在此先感謝!

+0

試試'rollapply'?此外,存儲您的'df $ id == df $ id [i]'比較,以便每次都不會重新計算。此外,利用這個事實,如果'I-6'在一週內,那麼'I-5','I-4'等也是。另請參閱:http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r/8474941#8474941 –

+0

謝謝你,偉大的建議。 – Statwonk

回答

5

如何:

> DT = as.data.table(df) 
> DT 
     id  date cnt 
[1,] 1 2012-01-01 1 
[2,] 1 2012-01-02 2 
[3,] 1 2012-01-03 3 
[4,] 1 2012-01-04 0 
[5,] 1 2012-01-05 0 
[6,] 1 2012-01-06 4 
[7,] 1 2012-01-07 5 
[8,] 1 2012-01-08 6 
[9,] 1 2012-01-09 7 
[10,] 1 2012-01-20 8 
[11,] 2 2012-01-01 0 
[12,] 2 2012-01-02 1 
[13,] 2 2012-01-03 0 
[14,] 2 2012-01-04 1 
[15,] 2 2012-01-05 2 
[16,] 2 2012-01-06 3 
[17,] 2 2012-01-07 4 
[18,] 2 2012-01-08 5 
[19,] 2 2012-01-09 6 
[20,] 2 2012-01-20 7 

組內。然後累積。這一步目前很難看,但按羣組劃分的:=(很快會在1.8.1)將會整理出來。

> DT[,cumcnt:=DT[,cumsum(cnt),by=id][[2]]] 
     id  date cnt cumcnt 
[1,] 1 2012-01-01 1  1 
[2,] 1 2012-01-02 2  3 
[3,] 1 2012-01-03 3  6 
[4,] 1 2012-01-04 0  6 
[5,] 1 2012-01-05 0  6 
[6,] 1 2012-01-06 4  10 
[7,] 1 2012-01-07 5  15 
[8,] 1 2012-01-08 6  21 
[9,] 1 2012-01-09 7  28 
[10,] 1 2012-01-20 8  36 
[11,] 2 2012-01-01 0  0 
[12,] 2 2012-01-02 1  1 
[13,] 2 2012-01-03 0  1 
[14,] 2 2012-01-04 1  2 
[15,] 2 2012-01-05 2  4 
[16,] 2 2012-01-06 3  7 
[17,] 2 2012-01-07 4  11 
[18,] 2 2012-01-08 5  16 
[19,] 2 2012-01-09 6  22 
[20,] 2 2012-01-20 7  29 

現在加入7天前,允許不規則日期:

> setkey(DT,id,date) 
> DT[,before7dayago:=DT[SJ(id,date-7),cumcnt,roll=TRUE,mult="last"]] 
     id  date cnt cumcnt before7dayago 
[1,] 1 2012-01-01 1  1   NA 
[2,] 1 2012-01-02 2  3   NA 
[3,] 1 2012-01-03 3  6   NA 
[4,] 1 2012-01-04 0  6   NA 
[5,] 1 2012-01-05 0  6   NA 
[6,] 1 2012-01-06 4  10   NA 
[7,] 1 2012-01-07 5  15   NA 
[8,] 1 2012-01-08 6  21    1 
[9,] 1 2012-01-09 7  28    3 
[10,] 1 2012-01-20 8  36   28 
[11,] 2 2012-01-01 0  0   NA 
[12,] 2 2012-01-02 1  1   NA 
[13,] 2 2012-01-03 0  1   NA 
[14,] 2 2012-01-04 1  2   NA 
[15,] 2 2012-01-05 2  4   NA 
[16,] 2 2012-01-06 3  7   NA 
[17,] 2 2012-01-07 4  11   NA 
[18,] 2 2012-01-08 5  16    0 
[19,] 2 2012-01-09 6  22    1 
[20,] 2 2012-01-20 7  29   22 

最後減去從另一個。

> DT[,`7daysum`:=cumcnt-before7dayago] 
     id  date cnt cumcnt before7dayago 7daysum 
[1,] 1 2012-01-01 1  1   NA  NA 
[2,] 1 2012-01-02 2  3   NA  NA 
[3,] 1 2012-01-03 3  6   NA  NA 
[4,] 1 2012-01-04 0  6   NA  NA 
[5,] 1 2012-01-05 0  6   NA  NA 
[6,] 1 2012-01-06 4  10   NA  NA 
[7,] 1 2012-01-07 5  15   NA  NA 
[8,] 1 2012-01-08 6  21    1  20 
[9,] 1 2012-01-09 7  28    3  25 
[10,] 1 2012-01-20 8  36   28  8 
[11,] 2 2012-01-01 0  0   NA  NA 
[12,] 2 2012-01-02 1  1   NA  NA 
[13,] 2 2012-01-03 0  1   NA  NA 
[14,] 2 2012-01-04 1  2   NA  NA 
[15,] 2 2012-01-05 2  4   NA  NA 
[16,] 2 2012-01-06 3  7   NA  NA 
[17,] 2 2012-01-07 4  11   NA  NA 
[18,] 2 2012-01-08 5  16    0  16 
[19,] 2 2012-01-09 6  22    1  21 
[20,] 2 2012-01-20 7  29   22  7 

這應該是非常快的。

+2

Bravo!謝謝,這個作品很棒。看起來我需要深入挖掘data.table。雖然我剛開始使用data.table,但我並不知道「by」函數。 – Statwonk