基於頻率表的子集/過濾器

我有一些文本數據的DF，例如，基於頻率表的子集/過濾器

words <- data.frame(terms = c("qhick brown fox", 
           "tom dick harry", 
           "cats dgs", 
           "qhick black fox"))

我已經能夠基於包含拼寫錯誤的任何行子集：

library(qdap) 
words[check_spelling(words$terms)$row,,drop=F]

但考慮到我有很多的文本數據的我只想上發生的拼寫錯誤過濾更頻繁：

> sort(which(table(which_misspelled(toString(unique(words$terms)))) > 1), decreasing = T) 
qhick 
    2

所以我現在知道那個「qhick」是一個常見的拼寫錯誤。

我怎麼能根據這個表子集詞？那麼只返回包含「qhick」的行？

來源

2017-06-30 Doug Fir

這個詞本身就是你的sort()函數的名字。如果你只有一個名字，你可以這樣做：

top_misspelled <- sort(which(table(which_misspelled(toString(unique(words$terms)))) > 1), decreasing = T) 

words[grepl(names(top_misspelled), words$terms), , drop = F] 
#   terms 
#1 qhick brown fox 
#4 qhick black fox

但是如果你有多個，你可能崩潰它們共同打造grepl查找，如：

words[grepl(paste0(names(top_misspelled), collapse = "|"), words$terms), ,drop = F]

非正則表達式的選擇將也可以將每一行拆分爲單詞，然後如果該行中的任何單詞與您感興趣的字符串匹配，則返回該行：

words[sapply(strsplit(as.character(words[,"terms"]), split=" "), function(x) any(x %in% names(top_misspelled))), 
     ,drop = F] 

#   terms 
#1 qhick brown fox 
#4 qhick black fox

來源

2017-06-30 02:15:22

感謝您的回答並對不接受感到抱歉。實際上，我想暫時將其打開一段時間，因爲當字符串是另一個較大的單詞的一部分時，正則表達式可能會導致意外行爲，例如，「災難性」的「貓」。 –

沒問題，另一個想法是使用'strsplit'拆分每一行，然後使用'sapply'來檢查該行中的任何元素是否匹配 –

謝謝你這樣做！我想知道是否有這樣做的「dplyr esque」方法，因爲我認爲我可以親自跟隨非正則表達式的方法，但閱讀起來很棘手。無論如何，再次感謝 –

基於頻率表的子集/過濾器

回答

相關問題