我找到了答案通過查看tm
庫source code爲removeWords功能和擴展的正則表達式:
gsub(sprintf("(*UCP)\\b(%s)\\b",
到
gsub(sprintf("(*UCP)\\b[a-zA-Z]*(%s)[a-zA-Z]*\\b",
完整的功能
removeWordsContaining <-
function(x, words)
UseMethod("removeWordsContaining", x)
removeWordsContaining.character <-
function(x, words)
gsub(sprintf("(*UCP)\\b[a-zA-Z]*(%s)[a-zA-Z]*\\b",
paste(sort(words, decreasing = TRUE), collapse = "|")),
"", x, perl = TRUE)
removeWordsContaining.PlainTextDocument <-
content_transformer(removeWordsContaining.character)
blog_corpus <- Corpus(vs, readerControl = list(language="en"))
blog_corpus <- tm_map(blog_corpus, content_transformer(tolower))
blog_corpus <- tm_map(blog_corpus, stripWhitespace)
blog_corpus <- tm_map(blog_corpus, removePunctuation)
blog_corpus <- tm_map(blog_corpus, removeNumbers)
blog_corpus <- tm_map(blog_corpus, removeWords, c(stopwords("english")))
blog_corpus <- tm_map(blog_corpus, removeWordsContaining, bannedWords$V1)
不確定'tm',但如果你有涉及到一個額外的軟件包,'quanteda'有一個函數'OK selectFeatures'(和相關的'removeFeatures'),它允許使用正則表達式和glob類型的通配符。有關示例,請參閱'?quanteda :: selectFeatures'。 – Jota
借調Quanteda。它比tm更直截了當。很快就會成爲文本處理的標準。 – lmkirvan