2013-05-29 39 views
3

具有分類變量catA,catB和catC的示例數據框。 Obs是一些觀測值。將函數應用於分類變量的所有可能組合的子集的數據框

catA <- rep(factor(c("a","b","c")), length.out=100) 
catB <- rep(factor(1:4), length.out=100) 
catC <- rep(factor(c("d","e","f")), length.out=100) 
obs <- runif(100,0,100) 
dat <- data.frame(catA, catB, catC, obs) 

通過分類變量的所有可能的數據子集。現在

allsubs <- expand.grid(catA = c(NA,levels(catA)), catB = c(NA,levels(catB)), 
    catC = c(NA,levels(catC))) 
> head(allsubs, n=10) 
    catA catB catC 
1 <NA> <NA> <NA> 
2  a <NA> <NA> 
3  b <NA> <NA> 
4  c <NA> <NA> 
5 <NA> 1 <NA> 
6  a 1 <NA> 
7  b 1 <NA> 
8  c 1 <NA> 
9 <NA> 2 <NA> 
10 a 2 <NA> 

,什麼是創建與含有從施加到的DAT相應的子集(由貓變量的組合在每行中所定義的)的函數的結果的結果列的輸出數據幀的最簡單的方法。所以輸出應該看起來像下面的數據框'whatiwant',其中結果列將包含應用於每個子集的函數的結果。

> whatiwant 
    catA catB catC results 
1 <NA> <NA> <NA>  * 
2  a <NA> <NA>  * 
3  b <NA> <NA>  * 
4  c <NA> <NA>  * 
5 <NA> 1 <NA>  * 
6  a 1 <NA>  * 
7  b 1 <NA>  * 
8  c 1 <NA>  * 
9 <NA> 2 <NA>  * 
10 a 2 <NA>  * 

所以,如果應用的功能是 '平均',結果應該是:

dat$results[1] = mean(subset(dat,)$obs) 
dat$results[2] = mean(subset(dat, catA=="a")$obs) 

等等,等等。

回答

1

這是不乾淨的解決方案,但我認爲它接近你想要的。

getAllSubs <- function(df, lookup, fun) { 

    out <- lapply(1:nrow(lookup), function(i) { 

    df_new <- df 

    if(length(na.omit(unlist(lookup[i,]))) > 0) { 

     for(j in colnames(lookup)[which(!is.na(unlist(lookup[i,])))]) { 
     df_new <- df_new[df_new[,j] == lookup[i,j],] 
     } 
    } 
    fun(df_new) 
    }) 

    if(mean(sapply(out, length) ==1) == 1) { 
    out <- unlist(out) 
    } else { 
    out <- do.call("rbind", out) 
    } 

    final <- cbind(lookup, out) 
    final[is.na(final)] <- NA 
    final 
} 

由於它目前已被寫入,您必須事先構建查找表,但您可以輕鬆地將該構造移入該函數本身。我在最後添加了幾行,以確保它可以適應不同長度的輸出,所以NaN變成了NAs,只是因爲這似乎創造了一個更清潔的輸出。正如它目前所寫,在所有列爲NA的情況下,它將該功能應用於整個原始數據幀。

dat_out <- getAllSubs(dat, allsubs, function(x) mean(x$obs, na.rm = TRUE)) 

head(dat_out,20) 

    catA catB catC  out 
1 <NA> <NA> <NA> 47.25446 
2  a <NA> <NA> 51.54226 
3  b <NA> <NA> 46.45352 
4  c <NA> <NA> 43.63767 
5 <NA> 1 <NA> 47.23872 
6  a 1 <NA> 66.59281 
7  b 1 <NA> 32.03513 
8  c 1 <NA> 40.66896 
9 <NA> 2 <NA> 45.16588 
10 a 2 <NA> 50.59323 
11 b 2 <NA> 51.02013 
12 c 2 <NA> 33.15251 
13 <NA> 3 <NA> 51.67809 
14 a 3 <NA> 48.13645 
15 b 3 <NA> 57.92084 
16 c 3 <NA> 49.27710 
17 <NA> 4 <NA> 44.93515 
18 a 4 <NA> 40.36266 
19 b 4 <NA> 44.26717 
20 c 4 <NA> 50.74718 
3
ans <- with(dat, tapply(obs, list(catA, catB, catC), mean)) 
ans <- data.frame(expand.grid(dimnames(ans)), results=c(ans)) 
names(ans)[1:3] <- names(dat)[1:3] 

str(ans) 
# 'data.frame': 36 obs. of 4 variables: 
# $ catA : Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ... 
# $ catB : Factor w/ 4 levels "1","2","3","4": 1 1 1 2 2 2 3 3 3 4 ... 
# $ catC : Factor w/ 3 levels "d","e","f": 1 1 1 1 1 1 1 1 1 1 ... 
# $ results: num 69.7 NA NA 55.3 NA ... 
2

另一種方法,一個函數獲取所有變量組合,另一個函數獲取所有子集的函數。該組合功能是從其他崗位被盜......

## return all combinations of vector up to maximum length n 
multicombn <- function(dat, n) { 
    unlist(lapply(1:n, function(x) combn(dat, x, simplify=F)), recursive=F) 
} 

對於allsubs,乏是形式c("catA","catB","catC"), out.name = "mean". FUNC需要形式ddply會採取寫的,這之間

func=function(x) mean(x$obs, na.rm=TRUE) 

library(plyr) 
allsubs <- function(indat, vars, func=NULL, out.name=NULL) { 
    results <- data.frame() 
    nvars <- rev(multicombn(vars,length(vars))) 
    for(i in 1:length(nvars)) { 
     results <- 
      rbind.fill(results, ddply(indat, unlist(nvars[i]), func)) 
    } 
    if(!missing(out.name)) names(results)[length(vars)+1] <- out.name 
    results 
} 

一個區別答案和shwaund的,這不會返回空行 子集,所以在結果列中沒有NAs。

allsubs(dat, c("catA","catB","catc"), func, out.name="mean") 
> head(allsubs(dat, vars, func, out.name = "mean"),20) 
    catA catB catC  mean 
1  a 1 d 56.65909 
2  a 2 d 54.98116 
3  a 3 d 37.52655 
4  a 4 d 58.29034 
5  b 1 e 52.88945 
6  b 2 e 50.43122 
7  b 3 e 52.57115 
8  b 4 e 59.45348 
9  c 1 f 52.41637 
10 c 2 f 34.58122 
11 c 3 f 46.80256 
12 c 4 f 51.58668 
13 <NA> 1 d 56.65909 
14 <NA> 1 e 52.88945 
15 <NA> 1 f 52.41637 
16 <NA> 2 d 54.98116 
17 <NA> 2 e 50.43122 
18 <NA> 2 f 34.58122 
19 <NA> 3 d 37.52655 
20 <NA> 3 e 52.57115 
1

只使用矢量化功能和基礎R

# Find all possible subsets of your data 
combVars <- c("catA", "catB", "catC") 
subsets <- lapply(0:length(combVars), combn, x = combVars, simplify = FALSE) 
subsets <- do.call(c, subsets) 
# Calculate means by each subset 
meanValues <- lapply(subsets, function(x) aggregate(dat[["obs"]], by = dat[x], FUN = mean)) 
# Pull them all into one dataframe 
Reduce(function(x,y) merge(x,y,all=TRUE), meanValues) 
+0

偉大的答案。我已經調整它爲data.tables [這裏](https://stackoverflow.com/a/45341665/4241780)。 – JWilliman

相關問題