2017-06-12 94 views
1

我試圖減少數據中每個因子變量的級數。我想先減少層級做2個操作的數量:減少每個因子dplyr方法的級別數

  1. 如果等級的數量比截止更大然後更換頻率較低的水平上一個新臺階,直到水平的數量已經達到了cut-關閉
  2. 一個因素沒有足夠的觀測替換水平提高到新的水平

我寫的正常工作的功能,但我不喜歡的代碼。如果剩餘水平沒有足夠的觀測值,這並不重要。我更喜歡dplyr方法。

ReplaceFactor <- function(data, max_levels, min_values_factor){ 
    # First make sure that not to many levels are in a factor 
    for(i in colnames(data)){ 
     if(class(data[[i]]) == "factor"){ 
      if(length(levels(data[[i]])) > max_levels){ 
       levels_keep <- names(sort(table(data[[i]]), decreasing = T))[1 : (max_levels - 1)] 
       data[!get(i) %in% levels_keep, (i) := "REMAIN"] 
       data[[i]] <- as.factor(as.character(data[[i]])) 
      } 
     } 
    } 
    # Now make sure that in each level has enough observations 
    for(i in colnames(data)){ 
     if(class(data[[i]]) == "factor"){ 
      if(min(table(data[[i]])) < min_values_factor){ 
       levels_replace <- table(data[[i]])[table(data[[i]]) < min_values_factor] 
       data[get(i) %in% names(levels_replace), (i) := "REMAIN"] 
       data[[i]] <- as.factor(as.character(data[[i]])) 
      } 
     } 
    } 
    return(data) 
} 
df <- data.frame(A = c("A","A","B","B","C","C","C","C","C"), 
       B = 1:9, 
       C = c("A","A","B","B","C","C","C","D","D"), 
       D = c("A","B","E", "E", "E","E","E", "E", "E")) 
str(df) 
'data.frame': 9 obs. of 4 variables: 
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3 
$ B: int 1 2 3 4 5 6 7 8 9 
$ C: Factor w/ 4 levels "A","B","C","D": 1 1 2 2 3 3 3 4 4 
$ D: Factor w/ 3 levels "A","B","E": 1 2 3 3 3 3 3 3 3 

dt2 <- ReplaceFactor(data = data.table(df), 
       max_levels = 3, 
       min_values_factor = 2) 
str(dt2) 
Classes ‘data.table’ and 'data.frame': 9 obs. of 4 variables: 
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3 
$ B: int 1 2 3 4 5 6 7 8 9 
$ C: Factor w/ 3 levels "A","C","REMAIN": 1 1 3 3 2 2 2 3 3 
$ D: Factor w/ 2 levels "E","REMAIN": 2 2 1 1 1 1 1 1 1 
- attr(*, ".internal.selfref")=<externalptr> 
dt2 
    A B  C  D 
1: A 1  A REMAIN 
2: A 2  A REMAIN 
3: B 3 REMAIN  E 
4: B 4 REMAIN  E 
5: C 5  C  E 
6: C 6  C  E 
7: C 7  C  E 
8: C 8 REMAIN  E 
9: C 9 REMAIN  E 
+2

我建議你看看'forcats'軟件包,它對這類任務有很好的功能:例如http://forcats.tidyverse.org/reference/ –

+0

'fct_lump'可能會有幫助 –

回答

5

使用forcats

library(dplyr) 
library(forcats) 

max_levels <- 3 
min_values_factor <- 2 
df %>% 
    mutate_if(is.factor, fct_lump, n = max_levels, 
      other_level = "REMAIN", ties.method = "first") %>% 
    mutate_if(is.factor, fct_lump, prop = (min_values_factor - 1)/nrow(.), 
      other_level = "REMAIN") 

# A B  C  D 
# 1 A 1  A REMAIN 
# 2 A 2  A REMAIN 
# 3 B 3  B  E 
# 4 B 4  B  E 
# 5 C 5  C  E 
# 6 C 6  C  E 
# 7 C 7  C  E 
# 8 C 8 REMAIN  E 
# 9 C 9 REMAIN  E 

(呵呵,我是不是能複製你的函數的具體行爲,但你可能會得到你想要的東西通過調整ties.method並從其減去1〜max_levels )。

相關問題