2014-05-11 155 views
1

滾動窗口,我有以下data.table:的R - 在data.table

  time  id type price  size api start.point end.point 
1: 1399672906 37119594 ASK 440.002 1.4840000 TRUE 1399672606 1399672906 
2: 1399672940 37119597 BID 441.000 0.1758830 TRUE 1399672640 1399672940 
3: 1399672940 37119598 BID 441.000 0.0491166 TRUE 1399672640 1399672940 
4: 1399673105 37119638 ASK 440.002 0.1313700 TRUE 1399672805 1399673105 
5: 1399673198 37119668 BID 441.000 0.0233013 TRUE 1399672898 1399673198 
6: 1399673198 37119669 BID 441.000 0.9744230 TRUE 1399672898 1399673198 
7: 1399673208 37119675 BID 441.000 0.1587060 TRUE 1399672908 1399673208 
8: 1399673208 37119676 BID 441.000 0.1238870 TRUE 1399672908 1399673208 
9: 1399673208 37119677 BID 441.001 0.0100000 TRUE 1399672908 1399673208 
10: 1399673208 37119678 BID 441.175 0.0129740 TRUE 1399672908 1399673208 
11: 1399673208 37119679 BID 441.192 0.0100000 TRUE 1399672908 1399673208 
12: 1399673208 37119680 BID 441.399 0.0129740 TRUE 1399672908 1399673208 
13: 1399673208 37119681 BID 441.499 1.7500000 TRUE 1399672908 1399673208 
14: 1399673208 37119682 BID 441.500 8.0214600 TRUE 1399672908 1399673208 
15: 1399673241 37119691 BID 441.500 0.0453001 TRUE 1399672941 1399673241 
16: 1399673274 37119696 ASK 440.030 0.9133460 TRUE 1399672974 1399673274 
17: 1399673360 37119705 BID 440.030 0.0580000 TRUE 1399673060 1399673360 
18: 1399673433 37119709 ASK 440.002 0.0319611 TRUE 1399673133 1399673433 
19: 1399673506 37119711 ASK 440.002 0.2618460 TRUE 1399673206 1399673506 
20: 1399673507 37119712 BID 440.002 1.0000000 TRUE 1399673207 1399673507 

其中:

  • 時間是unix時間戳
  • id是由指定的交易數量交換
  • 起始點=「時間」少於5分鐘
  • 終點=實際上等於變量「時間」

該系列不是等距的。變量start.point和end.point實際上創建了以變量「time」結尾的5分鐘移動窗口。我想計算特定窗口中交易的頻率。

我有在for循環中完成:

for (i in 1:nrow(trades)){ 

    trades[i, freq := length(unique(trades[time >= start.point[i] & time <= end.point[i]]$id))] 

    setTxtProgressBar(status.bar, i) 

} 

不過,我想知道是否有一些比較「時髦」 data.table方式。 我想是這樣的:

trades[, freq := list(length(unique(trades[time >= start.point & time <= end.point,]$id))), by = list(id)] 

但resuls是錯誤的,現在看來,這不會對 「行每行」 的基礎工作:

  time  id type price  size api start.point end.point freq 
    1: 1399672906 37119594 ASK 440.002 1.4840000 TRUE 1399672606 1399672906 100 
    2: 1399672940 37119597 BID 441.000 0.1758830 TRUE 1399672640 1399672940 100 
    3: 1399672940 37119598 BID 441.000 0.0491166 TRUE 1399672640 1399672940 100 
    4: 1399673105 37119638 ASK 440.002 0.1313700 TRUE 1399672805 1399673105 100 
    5: 1399673198 37119668 BID 441.000 0.0233013 TRUE 1399672898 1399673198 100 
    6: 1399673198 37119669 BID 441.000 0.9744230 TRUE 1399672898 1399673198 100 
    7: 1399673208 37119675 BID 441.000 0.1587060 TRUE 1399672908 1399673208 100 
    8: 1399673208 37119676 BID 441.000 0.1238870 TRUE 1399672908 1399673208 100 
    9: 1399673208 37119677 BID 441.001 0.0100000 TRUE 1399672908 1399673208 100 
10: 1399673208 37119678 BID 441.175 0.0129740 TRUE 1399672908 1399673208 100 
11: 1399673208 37119679 BID 441.192 0.0100000 TRUE 1399672908 1399673208 100 

UPDATE:

參見下面的結構:

structure(list(time = c(1399672906L, 1399673105L, 1399673274L, 
1399673433L, 1399673506L, 1399673531L), id = c(37119594L, 37119638L, 
37119696L, 37119709L, 37119711L, 37119717L), type = c("ASK", 
"ASK", "ASK", "ASK", "ASK", "ASK"), price = c(440.002, 440.002, 
440.03, 440.002, 440.002, 440), size = c(1.484, 0.13137, 0.913346, 
0.0319611, 0.261846, 3.168), api = c(TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE), start.point = c(1399672606, 1399672805, 1399672974, 
1399673133, 1399673206, 1399673231), end.point = c(1399672906L, 
1399673105L, 1399673274L, 1399673433L, 1399673506L, 1399673531L 
), freq = c(1L, 4L, 13L, 14L, 13L, 11L)), .Names = c("time", 
"id", "type", "price", "size", "api", "start.point", "end.point", 
"freq"), sorted = c("type", "time"), class = c("data.table", 
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000002e50788>) 

回答

4

我認爲這可以通過使用bioconductor package IRanges程序包來實現,直到data.table執行間隔連接/範圍連接。

require(IRanges) 
ir1 = IRanges(trades$time, width=1L) 
ir2 = IRanges(trades$start.point, trades$end.point) 

olaps = findOverlaps(ir1, ir2, type = "within") 
dt = data.table(queryHits(olaps), subjectHits(olaps))[, .N, by=V2] 

trades[dt$V2, freq := dt$N] 

#   time  id type price  size api start.point end.point freq 
# 1: 1399672906 37119594 ASK 440.002 1.4840000 TRUE 1399672606 1399672906 1 
# 2: 1399673105 37119638 ASK 440.002 0.1313700 TRUE 1399672805 1399673105 2 
# 3: 1399673274 37119696 ASK 440.030 0.9133460 TRUE 1399672974 1399673274 2 
# 4: 1399673433 37119709 ASK 440.002 0.0319611 TRUE 1399673133 1399673433 2 
# 5: 1399673506 37119711 ASK 440.002 0.2618460 TRUE 1399673206 1399673506 3 
# 6: 1399673531 37119717 ASK 440.000 3.1680000 TRUE 1399673231 1399673531 4 

HTH

+0

比是這個 - 這是一個神奇的解決方案!我敢肯定,沒有你的幫助,我就無法找到它:) –

+1

只是一個筆記。你可以更簡單地做到這一點。只需使用'counts < - IRanges :: countOverlaps(ir1,ir2,type =「within」)'來獲得計數。然後用'trades $ freq < - counts'將它添加到你的data.table中。希望有所幫助。 –