將R因子自動擴展爲每個因子水平的1/0指示變量的集合

我有一個R數據框，其中包含一個我想要「展開」的因子，因此對於每個因子級別，都有一個關聯列一個新的數據框，其中包含一個1/0指示符。例如，假設我有：將R因子自動擴展爲每個因子水平的1/0指示變量的集合

df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))

我想：

df.desired <- data.frame(foo = c(1,1,0,0), bar=c(0,0,1,1), ham=c(1,2,3,4))

因爲對於某些分析的，你需要有一個完全的數字數據幀（例如，主成分分析），我認爲這個功能可能是內置的。編寫一個函數來做到這一點不應該太難，但我可以預見與列名有關的一些挑戰，如果已經存在，我寧願使用它。

來源

2011-02-19 John Horton

105

使用model.matrix功能：

model.matrix(~ Species - 1, data=iris)

來源

2011-02-19 03:50:11

我可以補充一點，這個方法比使用`cast`快得多。 – 2013-12-08 15:03:25

@RyanChase，在您發表評論和我注意到它回覆的14小時內，您可能看過幫助頁面？公式「，並在詳細信息部分的第2段中找到答案。或者你可以嘗試使用和不使用「-1」的代碼，並比較輸出以查看效果。但我想你對我更有耐心。「-1」指定不適合截距（還有其他方式），因此可以爲每個等級創建指標變量，而不是基於對比度的差異。 – 2015-09-26 19:52:26

@GregSnow我回顧了'？formula`和`？model.matrix`的第二段，但它不清楚（可能只是因爲我缺乏矩陣代數和模型表達方面的知識深度）。挖掘更多之後，我已經能夠認識到-1只是指定不包含「截取」列。如果忽略-1，則會在輸出中看到一個1的截取列，其中一個二進制列被省略。您可以根據其他列的值爲0的行來查看省略列的值是1。文件看起來很神祕 - 還有另一個很好的資源嗎？ – 2015-10-05 22:25:56

可能虛擬變量類似於你想要的。然後，model.matrix是有用的：

> with(df.original, data.frame(model.matrix(~eggs+0), ham)) 
    eggsbar eggsfoo ham 
1  0  1 1 
2  0  1 2 
3  1  0 3 
4  1  0 4

來源

2011-02-19 03:49:59 kohske

如果您的數據幀只發因素（或者你在變量的一個子集這些都是工作因素），您還可以使用acm.disjonctif函數從ade4包：

R> library(ade4) 
R> df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red")) 
R> acm.disjonctif(df) 
    eggs.bar eggs.foo ham.blue ham.green ham.red 
1  0  1  0   0  1 
2  0  1  1   0  0 
3  1  0  0   1  0 
4  1  0  0   0  1

不完全的情況下，你是describi NG，但它可以是太有用了......

來源

2011-02-19 12:49:54 juba

使用reshape2包一個快速的方法：

require(reshape2) 

> dcast(df.original, ham ~ eggs, length) 

Using ham as value column: use value_var to override. 
    ham bar foo 
1 1 0 1 
2 2 0 1 
3 3 1 0 
4 4 1 0

注意，這正是產生你想要的列名。

來源

2011-02-19 13:09:11

剛剛遇到了這個舊線程，我想添加一個函數，它利用ade4獲取由因子和/或數值數據組成的數據框，並返回一個帶有因子的數據幀作爲虛擬代碼。

dummy <- function(df) { 

    NUM <- function(dataframe)dataframe[,sapply(dataframe,is.numeric)] 
    FAC <- function(dataframe)dataframe[,sapply(dataframe,is.factor)] 

    require(ade4) 
    if (is.null(ncol(NUM(df)))) { 
     DF <- data.frame(NUM(df), acm.disjonctif(FAC(df))) 
     names(DF)[1] <- colnames(df)[which(sapply(df, is.numeric))] 
    } else { 
     DF <- data.frame(NUM(df), acm.disjonctif(FAC(df))) 
    } 
    return(DF) 
}

讓我們試試吧。

df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), 
      ham = c("red","blue","green","red"), x=rnorm(4))  
dummy(df) 

df2 <-data.frame(eggs = c("foo", "foo", "bar", "bar"), 
      ham = c("red","blue","green","red")) 
dummy(df2)

來源

2011-10-30 04:38:40

從nnet包

library(nnet) 
with(df.original, data.frame(class.ind(eggs), ham)) 
    bar foo ham 
1 0 1 1 
2 0 1 2 
3 1 0 3 
4 1 0 4

來源

2013-02-19 05:04:09 mnel

我需要「爆炸」的因素是更柔軟的位的功能的遲進入class.ind，並提出一個基於從所述acm.disjonctif功能ade4包。這使您可以選擇acm.disjonctif中的分解值0和1。它只會爆炸「很少」水平的因素。數字列被保留。

# Function to explode factors that are considered to be categorical, 
# i.e., they do not have too many levels. 
# - data: The data.frame in which categorical variables will be exploded. 
# - values: The exploded values for the value being unequal and equal to a level. 
# - max_factor_level_fraction: Maximum number of levels as a fraction of column length. Set to 1 to explode all factors. 
# Inspired by the acm.disjonctif function in the ade4 package. 
explode_factors <- function(data, values = c(-0.8, 0.8), max_factor_level_fraction = 0.2) { 
    exploders <- colnames(data)[sapply(data, function(col){ 
     is.factor(col) && nlevels(col) <= max_factor_level_fraction * length(col) 
    })] 
    if (length(exploders) > 0) { 
    exploded <- lapply(exploders, function(exp){ 
     col <- data[, exp] 
     n <- length(col) 
     dummies <- matrix(values[1], n, length(levels(col))) 
     dummies[(1:n) + n * (unclass(col) - 1)] <- values[2] 
     colnames(dummies) <- paste(exp, levels(col), sep = '_') 
     dummies 
     }) 
    # Only keep numeric data. 
    data <- data[sapply(data, is.numeric)] 
    # Add exploded values. 
    data <- cbind(data, exploded) 
    } 
    return(data) 
}

來源

2015-06-22 09:57:24 rakensi

下面是一個更清楚的方法來做到這一點。我使用model.matrix創建虛擬布爾變量，然後將其合併回原始數據框。

df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4)) 
df.original 
# eggs ham 
# 1 foo 1 
# 2 foo 2 
# 3 bar 3 
# 4 bar 4 

# Create the dummy boolean variables using the model.matrix() function. 
> mm <- model.matrix(~eggs-1, df.original) 
> mm 
# eggsbar eggsfoo 
# 1  0  1 
# 2  0  1 
# 3  1  0 
# 4  1  0 
# attr(,"assign") 
# [1] 1 1 
# attr(,"contrasts") 
# attr(,"contrasts")$eggs 
# [1] "contr.treatment" 

# Remove the "eggs" prefix from the column names as the OP desired. 
colnames(mm) <- gsub("eggs","",colnames(mm)) 
mm 
# bar foo 
# 1 0 1 
# 2 0 1 
# 3 1 0 
# 4 1 0 
# attr(,"assign") 
# [1] 1 1 
# attr(,"contrasts") 
# attr(,"contrasts")$eggs 
# [1] "contr.treatment" 

# Combine the matrix back with the original dataframe. 
result <- cbind(df.original, mm) 
result 
# eggs ham bar foo 
# 1 foo 1 0 1 
# 2 foo 2 0 1 
# 3 bar 3 1 0 
# 4 bar 4 1 0 

# At this point, you can select out the columns that you want.

來源

2016-05-21 01:08:08 stackoverflowuser2010

將R因子自動擴展爲每個因子水平的1/0指示變量的集合

回答

相關問題