2016-04-06 104 views
0
# Generate counts table 
library(plyr) 
example <- data.frame(count(diamonds,c('color', 'cut'))) 
example[1:3,] 

# Excerpt of table 
     color cut freq 
1  D  Fair 163 
2  D  Good 662 
3  D Very Good 1513 

您可以使用:example[example$freq > 1000,]輕鬆篩選freq> 1000的表。我想生成一個與此類似的表格,除非所有小於值的值例如1000包含在一行(Other)中,類似於當您有太多因素時發生的情況,請致電summary(example, maxsum=3)如何將低頻因子分組爲R中的'其他'因子

 color   cut   freq  
D  : 5 Fair : 7 Min. : 119 
E  : 5 Good : 7 1st Qu.: 592 
(Other):25 (Other):21 Median :1204 
          Mean :1541 
          3rd Qu.:2334 
          Max. :4884 

例理想的輸出:

理想我想轉換此example[example$color=='J',]

color cut freq 
J  Fair 119 
J  Good 307 
J Very Good 678 
J Premium 808 
J  Ideal 896 

,併產生這樣的:

color  cut freq 
    J Very Good 678 
    J Premium 808 
    J  Ideal 896 
    J (Other) 426 

加分: 如果這種類型的過濾可以用ggplot創建一個如下所示的圖,但是使用這種過濾,那也會很棒。

ggplot(example, aes(x=color, y=freq)) + geom_bar(aes(fill=cut), stat = "identity") 

enter image description here

+1

看看這個類似的問:http://stackoverflow.com/questions/23730067/creating-an-other-field –

+0

那麼你對「其他」的閾值是多少? – mtoto

+0

@moto我不介意這個數字,但我希望將小於閾值的'cut'因子的頻率按顏色分組。 freq <500的閾值可能? – amblina

回答

1

嘗試這種情況:

library(plyr) 
library(ggplot2) 
example <- data.frame(count(diamonds,c('color', 'cut'))) 


# Compute the row id where frequency is lower than some threshold 
idx <- example$freq < 1000 

# Create a helper function that adds the level "Other" to a vector 
add_other_level <- function(x){ 
    levels(x) <- c(levels(x), "Other") 
    x 
} 

# Change the factor leves for the threshold id rows 
example <- within(example, 
     { 
     color <- add_other_level(color) 
     color[idx] <- "Other" 
     cut <- add_other_level(cut) 
     cut[idx] <- "Other" 
     } 
) 

# Create a plot 
ggplot(example, aes(x = color, y = freq, fill = cut)) + 
    geom_bar(stat = "identity") 

enter image description here

3

下面是使用dplyr到管正確的數據直接進入ggplot呼叫的替代方案。

library(dplyr) 
example %>% mutate(cut = ifelse(freq < 500, "Other", levels(cut))) %>% 
    group_by(color, cut) %>% 
    summarise(freq = sum(freq)) %>% 
    ggplot(aes(color, freq, fill = cut)) + 
    geom_bar(stat = "identity") 

enter image description here

務必取下plyr,否則輸出將會從dplyr調用不正確。