2014-02-19 31 views
0

我有調查答覆的數據框,其中一些列是參與者可以選擇多個答案的問題(「選擇所有適用的答案」)。將「選擇所有適用的選項」轉換爲二進制選項

> age <- c(24, 28, 44, 55, 53) 
> ethnicity <- c("ngoni", "bemba", "lozi tonga", "bemba tonga other", "bemba tongi") 
> ethnicity_other <- c(NA, NA, "luvale", NA, NA) 
> df <- data.frame(age, ethnicity, ethnicity_other) 

我想被設置爲二進制響應項這些問題,使得每個的響應選擇(在這種情況下ethnicityethnicity_other)變得與列向量0或者爲1。

到目前爲止,我寫分開單獨的唯一反應腳本到一個列表(z):

> x <- unique(as.vector(unlist(strsplit(as.character(df$ethnicity_other), " ")), mode="list")) 
> y <- unique(as.vector(unlist(strsplit(as.character(df$ethnicity), " ")), mode="list")) 
> 
> combine <- c(x, y) 
> 
> z <- NULL 
> for(i in combine){ 
> if(!is.na(i)){ 
> z <- append(z, i) 
> } 
> } 

我然後從列表中創建新的專欄,並與NA值填滿了。

> for(elm in z){ 
> df[paste0("ethnicity_",elm)] <- NA 
> } 

所以現在我有35個,我想,以填補一和零,這取決於該列名(或列名的一部分,因爲我ethnicity_前綴的話)可以在發現附加列在ethnicityethnicity_other.下的相應單元格我試圖用一些方法刺穿它,沒有很好的解決方案。

回答

0

這裏有一個可以使用plyrdata.table來實現。

all_ethnicities <- unique(c(
    unlist(strsplit(df$ethnicity, " ")), 
    unlist(strsplit(df$ethnicity_other, " ")) 
    )) 

df$id <- 1:nrow(df) 

library(plyr) 

ddply(df, .(id), function(x) 
     table(factor(unlist(strsplit(paste(x$ethnicity, x$ethnicity_other), " ")), 
        levels = all_ethnicities))) 

## id ngoni bemba lozi tonga other tongi luvale 
## 1 1  1  0 0  0  0  0  0 
## 2 2  0  1 0  0  0  0  0 
## 3 3  0  0 1  1  0  0  1 
## 4 4  0  1 0  1  1  0  0 
## 5 5  0  1 0  0  0  1  0 

library(data.table) 

DT <- data.table(df) 

DT[, { 
    as.list(
     table(
      factor(
       unlist(strsplit(paste(ethnicity, ethnicity_other), " ")), 
       levels = all_ethnicities) 
      ), 
     ) 
}, by = id] 

##  id ngoni bemba lozi tonga other tongi luvale 
## 1: 1  1  0 0  0  0  0  0 
## 2: 2  0  1 0  0  0  0  0 
## 3: 3  0  0 1  1  0  0  1 
## 4: 4  0  1 0  1  1  0  0 
## 5: 5  0  1 0  0  0  1  0 
+0

哇,這太棒了。非常感謝。我有點不清楚ddply函數是如何工作的(函數(x)...?),但我會稍微修改一下。我也試着讓每列都以「ethnicity_」爲前綴。在我的嘗試中,我在創建列名時使用了粘貼功能,但我很難理解在第一個解釋中列創建過程發生的位置。再次感謝!! – chrisnyoder

+0

@chrisnyoder'ddply'通過'id'變量(在這種情況下,只是每一行)分割數據,然後將該函數應用於每一條數據。所以函數'x'的輸入將是一行'data.frame'。試試'ddply(df,。(id),function(x)browser())'來探索函數的環境。爲了設置列名,最簡單的解決方案是在運行後執行此操作(例如,'out < - ddply(df,...)'then'names(out)[names(out)!=「id」] < paste0(「ethnicity_」,names(out)[names(out)!=「id」])'。我會在今天晚些時候添加更多這個答案 –

0

這裏是我會怎麼做:

首先,你需要一些東西來存儲每個參與者的種族。我的方式做到這一點是建立這些列表:

ethnicities = sapply(X=df$ethnicity, FUN=function(response) {return (strsplit(as.character(response), " "))}) 

爲了您的具體的例子,我們將有:

> ethnicities 
[[1]] 
[1] "ngoni" 

[[2]] 
[1] "bemba" 

[[3]] 
[1] "lozi" "tonga" 

[[4]] 
[1] "bemba" "tonga" "other" 

[[5]] 
[1] "bemba" "tongi" 

,然後遍歷這些來填補你的data.frame DF

for (i in seq_along(ethnicities)) { 
    for (eth in ethnicities[[i]]) { 
    df[[paste0('ethnicity_',eth)]][i]=1 
    } 
} 

DF將所得值應爲:

> df 
    age   ethnicity ethnicity_other ethnicity_luvale ethnicity_ngoni ethnicity_bemba 
1 24    ngoni    NA    NA    1    NA 
2 28    bemba    NA    NA    NA    1 
3 44  lozi tonga    NA    NA    NA    NA 
4 55 bemba tonga other    1    NA    NA    1 
5 53  bemba tongi    NA    NA    NA    1 
    ethnicity_lozi ethnicity_tonga ethnicity_tongi 
1    NA    NA    NA 
2    NA    NA    NA 
3    1    1    NA 
4    NA    1    NA 
5    NA    NA    1 

還有其他方法可以做到這一點。你也可以將這兩個打包成,但我感覺得到的代碼不會更高效(但是閱讀起來會更復雜!)。

這有幫助嗎?

編輯:

順便說一句,如果你真的想0,而不是NA在您的data.frame,它是改變你的代碼初始化添加的列一樣簡單:

> for(elm in z){ 
> df[paste0("ethnicity_",elm)] <- 0 # instead of NA 
> } 
0

下面是使用concat.split.expanded從我的「splitstackshape」包的方法:

## Combine your "ethnicity" and "ethnicity_other" column 
df$ethnicity <- paste(df$ethnicity, 
         ifelse(is.na(df$ethnicity_other), "", 
          as.character(df$ethnicity_other))) 
## Drop the original "ethnicity_other" column 
df$ethnicity_other <- NULL 

## Split up the new "ethnicity" column 
library(splitstackshape) 
concat.split.expanded(df, "ethnicity", sep=" ", 
         type="character", fill=0, drop=TRUE) 
# age ethnicity_bemba ethnicity_lozi ethnicity_luvale ethnicity_ngoni 
# 1 24    0    0    0    1 
# 2 28    1    0    0    0 
# 3 44    0    1    1    0 
# 4 55    1    0    0    0 
# 5 53    1    0    0    0 
# ethnicity_other ethnicity_tonga ethnicity_tongi 
# 1    0    0    0 
# 2    0    0    0 
# 3    0    1    0 
# 4    1    1    0 
# 5    0    0    1 

fill參數可以很容易地設置爲任何你想要的東西。它默認爲NA,但在這裏,我已將它設置爲0,因爲我認爲這就是您要查找的內容。