2016-08-17 31 views
1

我必須做一些明顯的愚蠢在這裏,但有人可以解釋爲什麼它看起來像data.table不由組操作執行以下操作你能解釋一下這個通過組data.table結果

set.seed(1) 
DT = data.table(grp=c(rep('a',100),rep('b',100)), val=c(runif(100), rnorm(100))) 
DT[grp=='a',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)] 

      10% 20% 30% 40% 50% 60% 70% 80% 90%   
    -Inf 0.1415 0.2555 0.3448 0.4108 0.4878 0.6442 0.7140 0.7842 0.8703 Inf 

DT[grp=='b',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)] 

       10%  20%  30%  40%  50%  60%  70%  80%  90%   
    -Inf -1.22751 -0.66000 -0.55036 -0.32170 -0.11762 0.06583 0.37427 0.69183 1.35196  Inf 

DT[,interval:=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)),.(grp)][] 

    grp  val  interval 
    1: a 0.2655 (-0.66,-0.55] => this is a "b" interval ? I would expect (0.2555 0.3448] 
    2: a 0.3721 (-0.55,-0.322] 
    3: a 0.5729 (-0.118,0.0658] 
    4: a 0.9082  (1.35, Inf] 
    5: a 0.2017 (-1.23,-0.66] 
---        
196: b -0.7508 (-1.23,-0.66] 
197: b 2.0872  (1.35, Inf] 
198: b 0.0174 (-0.118,0.0658] 
199: b -1.2863 (-Inf,-1.23] 
200: b -1.6406 (-Inf,-1.23] 

我通常要做這樣的事情:

DT[,mean(val),keyby=.(grp,interval=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)))] 
    grp  interval   V1 
1: a (-0.321,0.0379] 0.01836077 => this is not a "a" interval 
2: a (0.0379,0.21] 0.13190935 
3: a (0.21,0.358] 0.29068707 
4: a (0.358,0.477] 0.41647597 
5: a (0.477,0.648] 0.55190648 
6: a (0.648,0.777] 0.70883795 
7: a (0.777,0.915] 0.84091210 
8: a (0.915, Inf] 0.95797615 
9: b (-Inf,-0.657] -1.23322909 
10: b (-0.657,-0.321] -0.53243898 
11: b (-0.321,0.0379] -0.13968720 
12: b (0.0379,0.21] 0.11278201 
13: b (0.21,0.358] 0.30783459 
14: b (0.358,0.477] 0.40695489 
15: b (0.477,0.648] 0.55976052 
16: b (0.648,0.777] 0.70483170 
17: b (0.777,0.915] 0.91017423 
18: b (0.915, Inf] 1.57112705 

,這看起來很像,如果間隔是在整個數據集,而不是羣體定義

DT[,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)] 
        10%   20%   30%   40%   50%   60%   70%   80%   90%    
     -Inf -0.65729223 -0.32084835 0.03788176 0.20967534 0.35835115 0.47738589 0.64820328 0.77734560 0.91505885   Inf 
+0

是的,'by'或'keyby'中的所有內容都使用未分組的向量。這是混亂嗎? – Frank

+0

他媽的是...... DT [,,。(A,B)]查看A的所有值,B的所有值,然後在每個(A,B)對上進行分組並不是我期待的... pffff – statquant

+0

對,'DT [i,j,by]'的讀數是'i'的子集,然後按'by'分組,然後對每個組做「j」。 – Frank

回答

3

它看起來像你期待一種組合因子水平(這是什麼cut創建)的奇特方式。相反,你會發現奇怪的行爲,這是典型的因素。

我想你可以使用字符串:

DT[,interval := 
    as.character(cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf))) 
, by=grp] 

這給

 grp   val  interval 
    1: a 0.26550866 (0.256,0.345] 
    2: a 0.37212390 (0.345,0.411] 
    3: a 0.57285336 (0.488,0.644] 
    4: a 0.90820779  (0.87, Inf] 
    5: a 0.20168193 (0.142,0.256] 
---         
196: b -0.75081900 (-1.23,-0.66] 
197: b 2.08716655  (1.35, Inf] 
198: b 0.01739562 (-0.118,0.0658] 
199: b -1.28630053 (-Inf,-1.23] 
200: b -1.64060553 (-Inf,-1.23] 

這些時間間隔是不利於任何東西,但是。如果您嘗試按照他們排序,如DT[, mean(val), keyby=.(grp, interval)],您會看到它們出現故障。


如果你只是想解決這些削減爲單一計算...

mycut = function(x) cut(x,c(-Inf,quantile(x,probs=seq(.1,.9,.1)),Inf)) 

DT[,{ 
    .SD[, mean(val), keyby=.(interval=mycut(val))][, interval := as.character(interval)] 
},keyby=grp] 

這給

grp  interval   V1 
1: a (-Inf,0.142] 0.07670249 
2: a (0.142,0.256] 0.20584852 
3: a (0.256,0.345] 0.30715649 
4: a (0.345,0.411] 0.38583465 
5: a (0.411,0.488] 0.45901975 
6: a (0.488,0.644] 0.56413855 
7: a (0.644,0.714] 0.67442643 
8: a (0.714,0.784] 0.75834958 
9: a (0.784,0.87] 0.82747749 
10: a  (0.87, Inf] 0.91951669 
11: b (-Inf,-1.23] -1.54198329 
12: b (-1.23,-0.66] -0.92447488 
13: b (-0.66,-0.55] -0.61458549 
14: b (-0.55,-0.322] -0.45029247 
15: b (-0.322,-0.118] -0.22533466 
16: b (-0.118,0.0658] -0.01587467 
17: b (0.0658,0.374] 0.24836075 
18: b (0.374,0.692] 0.53061032 
19: b (0.692,1.35] 1.01688411 
20: b  (1.35, Inf] 1.80089535 

呀,不是很優雅,但我認爲這是一個問題來自R本身,它不應該如何改變以解決你的問題。

相關問題