2016-03-25 80 views
1

我有一些data.table的金額欄等間隔的行:骨料data.table到原始值

n = 1e5 
set.seed(1) 

dt <- data.table(id = 1:n, amount = pmax(0,rnorm(n, mean = 5e3, sd = 1e4))) 

和休息的向量給定,如:

breaks <- as.vector(c(0, t(sapply(c(1, 2.5, 5, 7.5), function(x) x * 10^(1:4))))) 

對於定義的每個間隔通過這些休息時間,我想使用data.table語法:

  1. 獲得計數amount包含
  2. 得到amount等於或大於計數比約束左側(基本n * (1-cdf(amount))

爲1,這主要是工作,但對於空的間隔不返回行:

dt[, .N, keyby = breaks[findInterval(amount,breaks)] ] #would prefer to get 0 for empty intvl 

對於2,我想:

dt[, sum(amount >= thresh[.GRP]), keyby = breaks[findInterval(amount,breaks)] ] 

,但它沒有工作,因爲sum是GRO內受限於沒有超越。因此,與一個解決辦法,這也返回空區間上來:

dt[, cbind(breaks, sapply(breaks, function(x) sum(amount >= x)))] # desired result 

那麼,有什麼解決的辦法data.table我2,並獲得兩個空的間隔?

+0

查看關於'foverlaps'的一些問題,只有幾個[1](http://stackoverflow.com/questions/25815032/finding-overlaps-between-interval-sets-efficient-overlap-joins),[2] (http://stackoverflow.com/questions/28540466/how-to-identify-overlaps-in-multiple-columns),[3](http://stackoverflow.com/questions/34245295/efficient-method-for-計數開箱每次提交在拉),[4](http://stackoverflow.com/questions/27574775/is-it-possible-to-use-the -r-data-table-funcion-foverlaps-to-find-the-intersectio) – MichaelChirico

回答

3

我會考慮這樣做:

mybreaks = c(-Inf, breaks, Inf) 
dt[, g := cut(amount, mybreaks)] 
dt[.(g = levels(g)), .N, on="g", by=.EACHI] 


        g  N 
1:  (-Inf,0] 30976 
2:   (0,10] 23 
3:   (10,25] 62 
4:   (25,50] 73 
5:   (50,75] 85 
6:  (75,100] 88 
7:  (100,250] 503 
8:  (250,500] 859 
9:  (500,750] 916 
10:  (750,1e+03] 912 
11: (1e+03,2.5e+03] 5593 
12: (2.5e+03,5e+03] 9884 
13: (5e+03,7.5e+03] 9767 
14: (7.5e+03,1e+04] 9474 
15: (1e+04,2.5e+04] 28434 
16: (2.5e+04,5e+04] 2351 
17: (5e+04,7.5e+04]  0 
18: (7.5e+04, Inf]  0 

您可以使用cumsum如果你想CDF。

+0

第三行的語法對我來說是新的,但我會對其進行閱讀。謝謝你的幫助。 – C8H10N4O2

+1

這是非常新的,在版本1.9.6,並沒有被添加到小插曲合併尚未。 'on ='只是一種做'X [Y]'的方法,即使當'X'沒有被鍵入時也是如此。 @ C8H10N4O2 – Frank