2017-04-24 114 views
2

爲了獨熱編碼在數據集中的因子變量,我在這個崗位使用用戶「奔」的巨大作用:How to one-hot-encode factor variables with data.table?一個炎熱的編碼創建n-1個虛擬變量

one_hot <- function(dt, cols="auto", dropCols=TRUE, dropUnusedLevels=FALSE){ 
    # One-Hot-Encode unordered factors in a data.table 
    # If cols = "auto", each unordered factor column in dt will be encoded. (Or specifcy a vector of column names to encode) 
    # If dropCols=TRUE, the original factor columns are dropped 
    # If dropUnusedLevels = TRUE, unused factor levels are dropped 

    # Automatically get the unordered factor columns 
    if(cols[1] == "auto") cols <- colnames(dt)[which(sapply(dt, function(x) is.factor(x) & !is.ordered(x)))] 

    # Build tempDT containing and ID column and 'cols' columns 
    tempDT <- dt[, cols, with=FALSE] 
    tempDT[, ID := .I] 
    setcolorder(tempDT, unique(c("ID", colnames(tempDT)))) 
    for(col in cols) set(tempDT, j=col, value=factor(paste(col, tempDT[[col]], sep="_"), levels=paste(col, levels(tempDT[[col]]), sep="_"))) 

    # One-hot-encode 
    if(dropUnusedLevels == TRUE){ 
    newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = T, fun = length) 
    } else{ 
    newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = F, fun = length) 
    } 

    # Combine binarized columns with the original dataset 
    result <- cbind(dt, newCols[, !"ID"]) 

    # If dropCols = TRUE, remove the original factor columns 
    if(dropCols == TRUE){ 
    result <- result[, !cols, with=FALSE] 
    } 

    return(result) 
} 

該函數爲每個因子列的所有n個因子級別創建n個虛擬變量。但是因爲我想使用這些數據進行建模,所以我只需要每個因子列有n-1個虛擬變量。這是可能的,如果是的話,我該如何使用此功能來做到這一點?

從我的角度來看,這條線必須調整:

newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value,  drop = T, fun = length) 

這裏是輸入表...

ID color size 
1: 1 black large 
2: 2 green medium 
3: 3 red small 

library(data.table) 
DT = setDT(structure(list(ID = 1:3, color = c("black", "green", "red"), 
    size = c("large", "medium", "small")), .Names = c("ID", "color", 
"size"), row.names = c(NA, -3L), class = "data.frame")) 

...和所需的輸出表:

ID color.black color.green size.large size.medium 
1 1 0 1 0 
2 0 1 0 1 
3 0 0 0 0 
+0

包'caret'中的'dummyVars'函數完全是這樣。有沒有使用它的原因? – Jealie

回答

2

這裏提供了一個解決方案,用於執行全列顛倒(即創建n-1列以避免共線性):

require('caret') 
data.table(ID=DT$ID, predict(dummyVars(ID ~ ., DT, fullRank = T),DT)) 

這正是這項工作:

ID colorgreen colorred sizemedium sizesmall 
1: 1   0  0   0   0 
2: 2   1  0   1   0 
3: 3   0  1   0   1 

所有可用的選項見this此功能的友好演練,並?dummyVars


另外:在註釋中,OP提到,此操作將需要進行數以百萬計的列行和數千個,從而證明了data.table的需要。如果這個簡單的預處理步驟對於「計算肌肉」來說太多了,那麼恐怕建模步驟(又名真實交易)註定要失敗。