消除造成影響的因素

-2

列中有數百個級別，並不是所有級別都能真正增加值 - 因爲大約60％的級別佔了80％（在數據幀中它們不會出現很多次）也預計不會影響結果。目標是消除那些貢獻不超過80％的水平。有人可以幫忙嗎？在此先感謝消除造成影響的因素

2017-08-30 emeralddove

它可以downvote - 你可以請添加一些意見，以便作出修改？謝謝。 – emeralddove

你的貢獻是什麼意思？你使用統計測試？哪一個？你能給一個可重現的例子嗎？ –

首先，您需要找到一種合理的方法來確定/指定要排除的級別。您可以基於統計測試或基於流行度（每個行數很少的級別）來做到這一點。然後，您應該考慮是否要消除它們（您也將消除整行）或將它們重新編碼到另一個級別（例如「休息」）。 – AntoniosK

以下是一個簡單的過程，它將佔數據集（行）不到80％的值映射到一起，並使用新值將它們組合在一起。此過程使用字符列而不是因子列。

library(dplyr) 

# example dataset 
dt = data.frame(type = c("A","A","A","B","B","B","c","D"), 
       value = 1:8, stringsAsFactors = F) 

dt 

# type value 
# 1 A  1 
# 2 A  2 
# 3 A  3 
# 4 B  4 
# 5 B  5 
# 6 B  6 
# 7 c  7 
# 8 D  8 

# count number of rows for each type 
dt %>% count(type) 

# # A tibble: 4 x 2 
# type  n 
# <chr> <int> 
# 1  A  3 
# 2  B  3 
# 3  c  1 
# 4  D  1 

# add cumulative percentages 
dt %>% 
    count(type) %>% 
    mutate(Prc = n/sum(n), 
     CumPrc = cumsum(Prc)) 

# # A tibble: 4 x 4 
# type  n Prc CumPrc 
# <chr> <int> <dbl> <dbl> 
# 1  A  3 0.375 0.375 
# 2  B  3 0.375 0.750 
# 3  c  1 0.125 0.875 
# 4  D  1 0.125 1.000 

# pick the types you want to group together 
dt %>% 
    count(type) %>% 
    mutate(Prc = n/sum(n), 
     CumPrc = cumsum(Prc)) %>% 
    filter(CumPrc > 0.80) %>% 
    pull(type) -> types_to_group 

# group them 
dt %>% mutate(type_upd = ifelse(type %in% types_to_group, "Rest", type)) 

# type value type_upd 
# 1 A  1  A 
# 2 A  2  A 
# 3 A  3  A 
# 4 B  4  B 
# 5 B  5  B 
# 6 B  6  B 
# 7 c  7  Rest 
# 8 D  8  Rest

來源

2017-08-30 16:21:20 AntoniosK

謝謝AntoniosK - 理解這個問題當然有認知負擔:) – emeralddove

消除造成影響的因素

回答

相關問題