2014-12-11 43 views
5

我想通過多種因素使用ddply來彙總來自多個變量的數據。R ddply循環;多重因素

我有下面的測試數據:

site block plot rep name weight height dtf 
Alberta 1 2 1 A 43 139 54 
Alberta 2 5 2 A 46 139 46 
Alberta 4 10 3 A 49 136 54 
Nunavut 1 1 1 A 49 136 59 
Nunavut 2 4 2 A 51 135 50 
Nunavut 3 8 3 A 52 133 56 
Alberta 5 13 1 B 55 132 50 
Alberta 4 12 2 B 55 125 46 
Alberta 5 15 3 B 56 120 46 
Nunavut 5 14 1 B 57 119 54 
Nunavut 5 13 2 B 58 119 55 
Nunavut 4 11 3 B 59 118 51 
... 

等。

我想把變量「weight」,「height」,「dtf」,並根據因素「site」和「name」進行彙總。

我開始與列名的載體:

data.factors <- NULL 
data.variables <- NULL 
for(n in 1:length(data)){if(is.factor(data[[n]])){ data.factors <- c(data.factors,colnames(data[n]))} else next} 
for(n in 1:length(data)){if(is.numeric(data[[n]]) || is.integer(data[[n]])){ data.variables <- c(data.variables,colnames(data[n]))} else next} 

這個工作對執行多單因素方差分析:

for(variables in data.variables){ 
for(factors in data.factors){ 
output1 <- aov(lm(data[[variables]]~data[[factors]])) 
cat(variables) 
cat(" by ") 
cat(factors) 
cat("\n") 
print(summary(output1)) 
}} 

但我不能讓它使用ddply工作。

for (x in data.variables){ 
variable.summary <- ddply(data, .(site,name), summarise, 
N = sum(!is.na(x[1])), 
min = min(x[1], na.rm=TRUE), 
max = max(x[1], na.rm=TRUE), 
mean = mean(x[1], na.rm=TRUE), 
sd = sd(x[1], na.rm=TRUE), 
se = sd/sqrt(N) 
) 
print(variable.summary) 
} 

我得到的是這樣的:

site name N min max mean sd se 
1 Alberta A 1 weight weight NA NA NA 
2 Alberta B 1 weight weight NA NA NA 
3 Alberta C 1 weight weight NA NA NA 
4 Alberta D 1 weight weight NA NA NA 
5 Alberta E 1 weight weight NA NA NA 
6 Nunavut A 1 weight weight NA NA NA 
7 Nunavut B 1 weight weight NA NA NA 
8 Nunavut C 1 weight weight NA NA NA 
9 Nunavut D 1 weight weight NA NA NA 
10 Nunavut E 1 weight weight NA NA NA 
.... 

是我ddply使用一個變量(直接,而通過「X」引用類型),它會正常工作測試。

讓函數識別引用的列ID有一個竅門嗎?我已經習慣了PERL,與可以在任何地方引用它的$標量,並希望類似的系統在R

回答

0

是可嘗試用data.table:

> testdt = data.table(testdf) 
> testdt[,list(meanwt=mean(weight),meanht=mean(height)),by=list(site,name)] 
     site name meanwt meanht 
1: Alberta A 46.00000 138.0000 
2: Nunavut A 50.66667 134.6667 
3: Alberta B 55.33333 125.6667 
4: Nunavut B 58.00000 118.6667 

最大值,最小值等可被添加到函數列表中。

3

到ddply的繼任者,dplyr,真的可以很容易地使用group_by()summarise_each()做到這一點,而無需環路什麼:

df <- data.frame(site = c("Alberta", "Alberta", "Alberta", "Nunavut", "Nunavut", "Nunavut", "Alberta", "Alberta", "Alberta", "Nunavut", "Nunavut", "Nunavut"), 
       block = c(1, 2, 4, 1, 2, 3, 5, 4, 5, 5, 5, 4), 
       plot = c(2, 5, 10, 1, 4, 8, 13, 12, 15, 14, 13, 11), 
       rep = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3), 
       name = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"), 
       weight = c(43, 46, 49, 49, 51, 52, 55, 55, 56, 57, 58, 59), 
       height = c(139, 139, 136, 136, 135, 133, 132, 125, 120, 119, 119, 118), 
       dtf = c(54, 46, 54, 59, 50, 56, 50, 46, 46, 54, 55, 51)) 

library(dplyr) 

df.summary <- df %>% 
    group_by(site, name) %>% 
    summarise_each(funs(sum, min, max, mean, sd), weight, height, dtf) 

導致這樣的數據幀:

> df.summary 
Source: local data frame [4 x 17] 
Groups: site 

    site name weight_length height_length dtf_length weight_min height_min dtf_min 
1 Alberta A    3    3   3   43  136  46 
2 Alberta B    3    3   3   55  120  46 
3 Nunavut A    3    3   3   49  133  50 
4 Nunavut B    3    3   3   57  118  51 
Variables not shown: weight_max (dbl), height_max (dbl), dtf_max (dbl), weight_mean (dbl), 
    height_mean (dbl), dtf_mean (dbl), weight_sd (dbl), height_sd (dbl), dtf_sd (dbl) 

您可以將想要的任何功能傳遞給summarise_each內部的funs(),因此如果您想要列出標準錯誤,只需將該功能設爲首先:

se <- function(x) { 
    N <- sum(!is.na(x[1])) 
    return(sd/sqrt(N)) 
} 

並通過:summarise_each(funs(sum, min, max, mean, sd, se)...)