確定在R數據幀中出現特定次數的值

我有一個字符串數據框，其中大部分都是重複的。我想確定這個數據框中至少出現x次的值。確定在R數據幀中出現特定次數的值

df <- data.frame(x = c("str", "str", "str", "ing", "ing",".")) 
    occurs <- 3

數據框包含數百個唯一字符串和數以萬計的元素。在這個例子中，我如何識別哪些字符串正在發生至少三次？具體來說，我想輸出符合此標準的字符串的名稱，而不是數據框中的索引。

來源

2015-06-21 Tanner

也許table是你所需要的 - 這裏是基於代碼的修改示例：

> df <- data.frame(x = c("str", "str", "str", "ing", "ing",".")) 
> df 
    x 
1 str 
2 str 
3 str 
4 ing 
5 ing 
6 . 
> table(df$x) 

    . ing str 
    1 2 3 
> table(df$x) > 2 

    . ing str 
FALSE FALSE TRUE 
> names(which(table(df$x) > 2)) 
[1] "str"

來源

2015-06-21 10:15:18

您還可以使用count：

library(dplyr) 
df %>% count(x)

這將調用n()計數觀測次數對於每個x：

# Source: local data frame [3 x 2] 
# 
#  x n 
# 1 . 1 
# 2 ing 2 
# 3 str 3

如果你只想要那些至少發生3次，使用filter()：

df %>% count(x) %>% filter(n >= 3)

其中給出：

# Source: local data frame [1 x 2] 
# 
#  x n 
# 1 str 3

最後，如果你只是想提取符合您的過濾條件的因素：

df %>% count(x) %>% filter(n >= 3) %>% .$x 

# [1] str 
# Levels: . ing str

按在註釋中建議由@大衛S，你也可以使用data.table：

library(data.table) 
setDT(df)[, if(.N >= 3) x, by = x]$V1

或者

setDT(df)[, .N, by = x][, x[N >= 3]] 

# [1] str 
# Levels: . ing str

按照由@Frank建議，你也可以使用table的「主力」 tabulate：

levels(df[[1]])[tabulate(df[[1]])>=3] 

# [1] "str"

Benchmark

df <- data.frame(x = sample(LETTERS[1:26], 10e6, replace = TRUE)) 
df2 <- copy(df) 

library(microbenchmark) 
mbm <- microbenchmark(
    base = names(which(table(df$x) >= 385000)), 
    base2 = levels(df[[1]])[tabulate(df[[1]])>385000L], 
    dplyr = count(df, x) %>% filter(n >= 385000) %>% .$x, 
    DT1 = setDT(df2)[, if(.N >= 385000) x, by = x]$V1, 
    DT2 = setDT(df2)[, .N, by = x][, x[N >= 385000]], 
    times = 50 
)

enter image description here

> mbm 
#Unit: milliseconds 
# expr  min  lq  mean median  uq  max neval cld 
# base 495.44936 523.29186 545.08199 543.56660 551.90360 652.13492 50 d 
# base2 20.08123 20.09819 20.11988 20.10633 20.14137 20.20876 50 a 
# dplyr 226.75800 227.27992 231.19709 228.36296 232.71308 259.20770 50 c 
# DT1 41.03576 41.28474 50.92456 48.40740 48.66626 168.53733 50 b 
# DT2 41.45874 41.85510 50.76797 48.93944 49.49339 74.58234 50 b

來源

2015-06-21 13:14:28

我不知道'庫（data.table）; setDT（df）[，如果（.N> =發生）x，by = x] $ V1'表示。或者'setDT（df）[，.N，by = x] [，x [N> = happen]]'（不知道哪個更好） –

應該真快。讓我把它添加到基準。 –

添加時，請勿在同一數據集上運行。創建'df2 < - copy（df）'，然後在'df2'上運行'data.table'基準測試。否則，'setDT'會在所有其他函數的第一次迭代中將'df'轉換爲'data.table'。 –

確定在R數據幀中出現特定次數的值

回答

相關問題