2013-08-24 23 views
1

我工作的時間序列數據,我需要計算匹配條件的當前行之前的行數。例如,我需要知道該行的月份和客戶有多少個月銷售(NETSALES> 0)。理想情況下,我會維護一個行計數器,當條件失敗時重置(例如NETSALES = 0)。計數與條件匹配的前面的行

解決該問題的另一種方法是標記具有超過12個以前NETSALES週期的行。

我用的是

COUNT(*) 
OVER (PARTITION BY cust ORDER BY dt 
    ROWS 12 PRECEDING) as CtWindow, 

http://sqlfiddle.com/#!6/990eb/2

在上面的例子來最接近,201310被正確標記爲12,但理想上一行本來11

解決方案可以在R或T-SQL中。

更新與data.table例如

library(data.table) 
set.seed(50) 
DT <- data.table(NETSALES=ifelse(runif(40)<.15,0,runif(40,1,100)), cust=rep(1:2, each=20), dt=1:20) 

的目標是計算像下方的「運行」柱 - 其被複位到零,如果值是零

 NETSALES cust dt run 
1: 36.956464 1 1 1 
2: 83.767621 1 2 2 
3: 28.585003 1 3 3 
4: 10.250524 1 4 4 
5: 6.537188 1 5 5 
6: 0.000000 1 6 6 
7: 95.489944 1 7 7 
8: 46.351387 1 8 8 
9: 0.000000 1 9 0 
10: 0.000000 1 10 0 
11: 99.621881 1 11 1 
12: 76.755104 1 12 2 
13: 64.288721 1 13 3 
14: 0.000000 1 14 0 
15: 36.504473 1 15 1 
16: 43.157142 1 16 2 
17: 71.808349 1 17 3 
18: 53.039105 1 18 4 
19: 0.000000 1 19 0 
20: 27.387369 1 20 1 
21: 58.308899 2 1 1 
22: 65.929296 2 2 2 
23: 20.529473 2 3 3 
24: 58.970898 2 4 4 
25: 13.785201 2 5 5 
26: 4.796752 2 6 6 
27: 72.758112 2 7 7 
28: 7.088647 2 8 8 
29: 14.516362 2 9 9 
30: 94.470714 2 10 10 
31: 51.254178 2 11 11 
32: 99.544261 2 12 12 
33: 66.475412 2 13 13 
34: 8.362936 2 14 14 
35: 96.742115 2 15 15 
36: 15.677712 2 16 16 
37: 0.000000 2 17 0 
38: 95.684652 2 18 1 
39: 65.639292 2 19 2 
40: 95.721081 2 20 3 
    NETSALES cust dt run 

回答

3

這似乎是這樣做的:

library(data.table) 
set.seed(50) 
DT <- data.table(NETSALES=ifelse(runif(40)<.15,0,runif(40,1,100)), cust=rep(1:2, each=20), dt=1:20) 
DT[,dir:=ifelse(NETSALES>0,1,0)] 
dir.rle <- rle(DT$dir) 
DT <- transform(DT, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths)) 
DT[,runl:=cumsum(dir),by=indexer] 

Credit to Cumulative sums over run lengths. Can this loop be vectorized?


編輯羅蘭:

這裏是更好的性能同樣也考慮到不同客戶:

#no need for ifelse 
DT[,dir:= NETSALES>0] 

#use a function to avoid storing the rle, which could be huge 
runseq <- function(x) { 
    x.rle <- rle(x) 
    rep(1:length(x.rle$lengths), x.rle$lengths) 
} 

#never use transform with data.table 
DT[,indexer := runseq(dir)] 

#include cust in by 
DT[,runl:=cumsum(dir),by=list(indexer,cust)] 

編輯:喬添加SQL解決方案 http://sqlfiddle.com/#!6/990eb/22

SQL解決方案是48分鐘在一臺機器上裝有128mig的橫跨22m行的RAM。 R解決方案在具有4個ram的工作站上大約需要20秒。去R!

+2

不要在data.table中使用'transform':[reason](http://stackoverflow.com/q/18216658/1412059)。 – Roland

+0

偉大的編輯和答案!謝謝 – Joe

相關問題