2014-06-17 48 views
7

我想按因子子集數據框。我只想保留一定頻率以上的因子水平。從數據框中刪除罕見因子水平的優雅方法

df <- data.frame(factor = c(rep("a",5),rep("b",5),rep("c",2)), variable = rnorm(12)) 

此代碼創建數據幀:

factor variable 
1  a -1.55902013 
2  a 0.22355431 
3  a -1.52195456 
4  a -0.32842689 
5  a 0.85650212 
6  b 0.00962240 
7  b -0.06621508 
8  b -1.41347823 
9  b 0.08969098 
10  b 1.31565582 
11  c -1.26141417 
12  c -0.33364069 

我要丟棄哪些重複小於5倍因子水平。我開發了一個for循環和它的工作:

for (i in 1:length(levels(df$factor))){ 
    if(table(df$factor)[i] < 5){ 
    df.new <- df[df$factor != names(table(df$factor))[i],] 
    } 
} 

但做更快,更漂亮的解決方案存在?

回答

6

什麼

df.new <- df[!(as.numeric(df$factor) %in% which(table(df$factor)<5)),] 
0

嘗試對基函數...

lvl = as.data.frame(table(df$factor)) 
colnames(lvl) = c('factor','count') 
lvl 
    factor count 
1  a  5 
2  b  5 
3  c  2 

df[df$factor %in% lvl[lvl$count>=5,]$factor,] 
    factor variable 
1  a -0.01619026 
2  a 0.94383621 
3  a 0.82122120 
4  a 0.59390132 
5  a 0.91897737 
6  b 0.78213630 
7  b 0.07456498 
8  b -1.98935170 
9  b 0.61982575 
10  b -0.05612874 
3

用的因素過濾計數可能加盟:

library(dplyr) 
common.factors <- df %.% group_by(factor) %.% tally() %.% filter(n >= 5) 
df.1 <- semi_join(df, common.factors) 
+1

你可能需要一個半加入 – hadley

5
library(data.table) 
setDT(df)[, variable[.N >= 5], by = factor] 

## factor   V1 
## 1:  a -0.8204684 
## 2:  a 0.4874291 
## 3:  a 0.7383247 
## 4:  a 0.5757814 
## 5:  a -0.3053884 
## 6:  b 1.5117812 
## 7:  b 0.3898432 
## 8:  b -0.6212406 
## 9:  b -2.2146999 
## 10:  b 1.1249309 
+2

+1和哇,'data.table'繼續留下深刻印象。我唯一的批評是難以閱讀。 – Hugh

+2

@Hugh,它比'dplyr' :)更難嗎? –

+2

@DavidArenburg查看@beginneR的'dplyr'解決方案。我還發現'dplyr'語法比'data.table'更容易閱讀。 – rrs

8
require(dplyr) 

df %>% group_by(factor) %>% filter(n() >= 5) 
#factor variable 
#1  a 2.0769363 
#2  a 0.6187513 
#3  a 0.2426108 
#4  a -0.4279296 
#5  a 0.2270024 
#6  b -0.6839748 
#7  b -0.3285610 
#8  b 0.2625743 
#9  b -0.9532957 
#10  b 1.4526317 
4

直到最近我纔會同意group_by +過濾器。然而,從tidyverseforcats包裝的另一個解決方案是

require(forcats) 
require(dplyr) 

df %>% filter(fct_lump(factor, n=5) != "Other") 

我們也可以把它多一點的表現,通過使用NA爲低頻類別:

df %>% filter(!is.na(fct_lump(factor, n=5, other_level=NA))) 
相關問題