2016-06-28 28 views
2

我想統計個體並結合變量的出現(1表示存在,0表示不存在)。這可以通過table函數的多次使用來獲得(參見下面的MWE)。如果有人給我更有效的方法來獲得下面給出的要求輸出,將不勝感激。由於聯合發生R中的變量

set.seed(12345) 
A <- rbinom(n = 100, size = 1, prob = 0.5) 
B <- rbinom(n = 100, size = 1, prob = 0.6) 
C <- rbinom(n = 100, size = 1, prob = 0.7) 
df <- data.frame(A, B, C) 

table(A) 
A 
0 1 
48 52 

table(B) 
B 
0 1 
53 47 

table(C) 
C 
0 1 
34 66 

table(A, B) 
    B 
A 0 1 
    0 25 23 
    1 28 24 

table(A, C) 
    C 
A 0 1 
    0 12 36 
    1 22 30 

table(B, C) 
    C 
B 0 1 
    0 21 32 
    1 13 34 

table(A, B, C) 
, , C = 0 

    B 
A 0 1 
    0 8 4 
    1 13 9 

, , C = 1 

    B 
A 0 1 
    0 17 19 
    1 15 15 

所需的輸出

我需要像下面這樣:

A = 52 
B = 45 
C = 66 
A + B = 24 
A + C = 30 
B + C = 34 
A + B + C = 15 
+0

準確地說,輸出應該如何結構化?對於上面的許多人來說,也是'crossprod(as.matrix(df))' –

+0

所以你不想把'A'與'AB'分開計算嗎? – TARehman

+0

是的,你正確@TARehman – MYaseen208

回答

1

擴展在Sumedh的回答,你也可以做到這一點動態,而不必每次都指定過濾器。如果您有多於三列的組合,這將非常有用。

你可以做這樣的事情:

lapply(seq_len(ncol(df)), function(i){ 
    # Generate all the combinations of i element on all columns 
    tmp_i = utils::combn(names(df), i) 
    # In the columns of tmp_i we have the elements in the combination 
    apply(tmp_i, 2, function(x){ 
    dynamic_formula = as.formula(paste("~", paste(x, "== 1", collapse = " & "))) 
    df %>% 
     filter_(.dots = dynamic_formula) %>% 
     summarize(Count = n()) %>% 
     mutate(type = paste0(sort(x), collapse = "")) 
    }) %>% 
    bind_rows() 
}) %>% 
    bind_rows() 

這將:

1)產生DF的列的所有組合。先用一個元件組合(A,B,C),然後用兩個元件(AB,AC,BC)等。 這是外部lapply

2)然後對於每個組合將創建一個動態式的那些。對於AB,例如公式將是A == 1 & B == 1,正如Sumedh所建議的那樣。這是dynamic_formula位。

3)將過濾與所述動態地生成的式數據幀和計數行數

4)綁定所有在一起(這兩個bind_rows

的輸出將是

Count type 
1 52 A 
2 47 B 
3 66 C 
4 24 AB 
5 30 AC 
6 34 BC 
7 15 ABC 
+0

感謝@洛倫佐的有用答案。如果你解釋的話,將不勝感激_如果你有不止3列的組合,這將是有用的._ – MYaseen208

+0

我的意思是,如果你想要結合使用3列的數據框,你可以使用完全相同的解決方案: A,B,C它可以像4,5,6列一樣工作,所以如果你也加上D < - rbinom(n = 100,size = 1,prob = 0.5),E < - rbinom(n = 100,size = 1,prob = 0.6)等等,它仍然可以正常工作並計算所有的組合 –

0

使用dplyr
發生只有A的:

library(dplyr) 
df %>% filter(A == 1) %>% summarise(Total = nrow(.)) 

發生A和B:

df %>% filter(A == 1, B == 1) %>% summarise(Total = nrow(.)) 

Occurence A,B的,和C

df %>% filter(A == 1, B == 1, C == 1) %>% summarise(Total = nrow(.)) 
1

編輯添加:我現在看到你不想獲得排他性計數(即A和AB都應該包含所有的As)。

今天我得到了一點點nerd-sniped,特別是因爲我想用無R包的R來解決它。下面應該這樣做。

有一個非常簡單的(原則上)解決方案,簡單地使用xtabs(),我已經在下面說明了。然而,爲了將其推廣到任何可能的維數,然後將其應用於各種組合,實際上更困難。我努力避免使用可怕的eval(parse())

set.seed(12345) 
A <- rbinom(n = 100, size = 1, prob = 0.5) 
B <- rbinom(n = 100, size = 1, prob = 0.6) 
C <- rbinom(n = 100, size = 1, prob = 0.7) 
df <- data.frame(A, B, C) 

# Turn strings off 
options(stringsAsFactors = FALSE) 

# Obtain the n-way frequency table 
# This table can be directly subset using [] 
# It is a little tricky to pass the arguments 
# I'm trying to avoid eval(parse()) 
# But still give a solution that isn't bound to a specific size 
xtab_freq <- xtabs(formula = formula(x = paste("~",paste(names(df),collapse = " + "))), 
        data = df) 

# Demonstrating what I mean 
# All A 
sum(xtab_freq["1",,]) 
# [1] 52 

# AC 
sum(xtab_freq["1",,"1"]) 
# [1] 30 

# Using lapply(), we pass names(df) to combn() with m values of 1, 2, and 3 
# The output of combn() goes through list(), then is unlisted with recursive FALSE 
# This gives us a list of vectors 
# Each one being a combination in which we are interested 
lst_combs <- unlist(lapply(X = 1:3,FUN = combn,x = names(df),list),recursive = FALSE) 

# For nice output naming, I just paste the values together 
names(lst_combs) <- sapply(X = lst_combs,FUN = paste,collapse = "") 

# This is a function I put together 
# Generalizes process of extracting values from a crosstab 
# It does it in this fashion to avoid eval(parse()) 
uFunc_GetMargins <- function(crosstab,varvector,success) { 

    # Obtain the dimname-names (the names within each dimension) 
    # From that, get the regular dimnames 
    xtab_dnn <- dimnames(crosstab) 
    xtab_dn <- names(xtab_dnn) 

    # Use match() to get a numeric vector for the margins 
    # This can be used in margin.table() 
    tgt_margins <- match(x = varvector,table = xtab_dn) 

    # Obtain a margin table 
    marginal <- margin.table(x = crosstab,margin = tgt_margins) 

    # To extract the value, figure out which marginal cell contains 
    # all variables of interest set to success 
    # sapply() goes over all the elements of the dimname names 
    # Finds numeric index in that dimension where the name == success 
    # We subset the resulting vector by tgt_margins 
    # (to only get the cells in our marginal table) 
    # Then, use prod() to multiply them together and get the location 
    tgt_cell <- prod(sapply(X = xtab_dnn, 
          FUN = match, 
          x = success)[tgt_margins]) 

    # Return as named list for ease of stacking 
    return(list(count = marginal[tgt_cell])) 
} 

# Doing a call of mapply() lets us get the results 
do.call(what = rbind.data.frame, 
     args = mapply(FUN = uFunc_GetMargins, 
         varvector = lst_combs, 
         MoreArgs = list(crosstab = xtab_freq, 
             success = "1"), 
         SIMPLIFY = FALSE, 
         USE.NAMES = TRUE)) 
#  count 
# A  52 
# B  47 
# C  66 
# AB  24 
# AC  30 
# BC  34 
# ABC 15 

我放棄了以前使用aggregate的解決方案。