2013-11-24 30 views
1

我想重新編寫下面稍微複雜的plyr命令,以便它更快,並使用聚合,tapply或data.table。用集合重寫複雜的ddply

該功能允許您輸入多個ID變量並測量變量,然後返回幾個計算結果。但是,在更大的數據集上,它可能不是最有效的。


這裏是代碼...

require(ggplot2) # to get the diamonds data set 
require(plyr) 

mean_sd_for_several_variables <- function(df, lvls, measures) { 
    res <- ddply(df, lvls, function(x) { 
       ret <- vector() 
       for(measure in measures) { 
        mean_sd <- c(mean(x[,measure]), sd(x[,measure])) 
        names(mean_sd) <- c(paste0("mean_", measure), paste0("sd_", measure)) 
        ret <- c(ret, mean_sd) 
       } 
       return(ret) 
       } 
) 

    print(res) 
} 

...返回:

mean_sd_for_several_variables(diamonds, c("color", "cut"), c("price","depth")) 

    color  cut mean_price sd_price mean_depth sd_depth 
1  D  Fair  4291.1 3286.1  64.048 3.29220 
2  D  Good  3405.4 3175.1  62.366 2.22240 
3  D Very Good  3470.5 3523.8  61.750 1.46223 
4  D Premium  3631.3 3711.6  61.169 1.15806 
5  D  Ideal  2629.1 3001.1  61.678 0.71201 
6  E  Fair  3682.3 2976.7  63.320 4.42103 
7  E  Good  3423.6 3330.7  62.204 2.23059 
8  E Very Good  3214.7 3408.0  61.730 1.42377 
9  E Premium  3538.9 3795.0  61.176 1.16454 
10  E  Ideal  2597.6 2956.0  61.687 0.70718 
11  F  Fair  3827.0 3223.3  63.508 3.70209 
12  F  Good  3495.8 3202.4  62.202 2.23976 
13  F Very Good  3778.8 3786.1  61.722 1.38939 
14  F Premium  4324.9 4012.0  61.260 1.16775 
15  F  Ideal  3374.9 3766.6  61.676 0.69398 
16  G  Fair  4239.3 3609.6  64.340 3.57340 
17  G  Good  4123.5 3702.5  62.527 2.03893 
18  G Very Good  3872.8 3861.4  61.841 1.33169 
19  G Premium  4500.7 4356.6  61.279 1.15341 
20  G  Ideal  3720.7 4006.3  61.700 0.68714 
21  H  Fair  5135.7 3886.5  64.585 3.14173 
22  H  Good  4276.3 4020.7  62.500 2.09212 
23  H Very Good  4535.4 4185.8  61.968 1.31895 
24  H Premium  5216.7 4466.2  61.322 1.15164 
25  H  Ideal  3889.3 4013.4  61.733 0.72939 
26  I  Fair  4685.4 3730.3  64.221 3.68771 
27  I  Good  5078.5 4631.7  62.475 2.17958 
28  I Very Good  5255.9 4687.1  61.935 1.32890 
29  I Premium  5946.2 5053.7  61.329 1.15338 
30  I  Ideal  4452.0 4505.2  61.794 0.72334 
31  J  Fair  4975.7 4050.5  64.357 3.31595 
32  J  Good  4574.2 3707.8  62.396 2.12091 
33  J Very Good  5103.5 4135.7  61.902 1.33679 
34  J Premium  6294.6 4788.9  61.390 1.13989 
35  J  Ideal  4918.2 4476.2  61.822 0.94669 
+0

嗯,沒有太多的時間正確的認識,只是覺得我做了我的gateveys包類似的東西。 'calcShares'函數在這裏:https://github.com/mbannert/gateveys/blob/master/R/gateveys.R以類似的方式使用data.table。不知道這是否對你更直觀,但至少是data.table。 –

回答

4

使用aggregate

> result <- aggregate(cbind(price, depth) ~ color+cut, 
        FUN=function(x) c(mean=mean(x), sd=sd(x)), 
        data=diamonds) 
> do.call(data.frame, result) 
    color  cut price.mean price.sd depth.mean depth.sd 
1  D  Fair 4291.061 3286.114 64.0484663 3.2921972 
2  E  Fair 3682.312 2976.652 63.3196429 4.4210329 
3  F  Fair 3827.003 3223.303 63.5080128 3.7020938 
4  G  Fair 4239.255 3609.644 64.3398089 3.5733985 
5  H  Fair 5135.683 3886.482 64.5851485 3.1417311 
+0

嗯..我得到以下錯誤:錯誤在FUN(X [[1L]],...):數組元素不匹配 –

+0

再試一次,我沒有得到任何錯誤。 –

+0

抱歉。這是一個名爲memisc的軟件包。哎呀! –

4

這裏是data.table解決

mean_and_sd <- function(.SD){ 
    x1 = lapply(.SD, mean) 
    x2 = lapply(.SD, sd) 
    cbind(x1, x2) 
} 

library(data.table) 
DT = data.table(diamonds) 
DT[, mean_and_sd(.SD), by = c("cut", "color"), .SDcols = c("price", "carat")] 

你可以拋出此變成一個接受d的函數預期的輸入並返回適當的數據幀。

+0

請注意,'.SD'針對'j'中的lapply(.SD,...)進行了優化,因此隨着'by'中組的數量的增加,這將會變慢。我可以想到的另一種方式是:DT [,unlist(lapply(.SD,function(x)list(mean = mean(x),sd = sd(x))),rec = FALSE)by = list (剪切,顏色),.SDcols = c(「價格」,「克拉」)](隨着組數的增加,時間差異將更加顯着)。 – Arun

+1

謝謝@阿倫。我在'lapply'裏面如何添加'mean'和'sd'。這個選擇聽起來很棒。 – Ramnath