如何通過tm包刪除單詞中的括號？

比方說，我有文本中的一部分這樣的文件：如何通過tm包刪除單詞中的括號？

"Other segment comprised of our active pharmaceutical ingredient (API) business,which..."

我想刪除「（API）」，它需要

corpus <- tm_map(corpus, removePunctuation)

之前完成取出後「（API）」，它應該是這個樣子如下：

"Other segment comprised of our active pharmaceutical ingredient business,which..."

我搜索了很久，但所有我能找到大約只有刪除括號，這個詞的答案中的我不知道也要在語料庫中出現。

我真的需要有人給我一些提示PLZ。

來源

2015-10-16 John Chou

你可以用一個更聰明tokeniser，如在quanteda包，其中removePunct = TRUE會自動刪除括號。

quanteda::tokenize(txt, removePunct = TRUE) 
## tokenizedText object from 1 document. 
## Component 1 : 
## [1] "Other"   "segment"  "comprised"  "of"    "our"   ## "active"   "pharmaceutical" 
## [8] "ingredient"  "API"   "business"  "which"

補充：

如果你想先tokenise文本，那麼你就需要一個lapply直到gsub我們quanteda添加一個正則表達式valuetype到removeFeatures.tokenizedTexts()。但是，這會工作：

# tokenized version 
require(quanteda) 
toks <- tokenize(txt, what = "fasterword", simplify = TRUE) 
toks[-grep("^\\(.*\\)$", toks)] 
## [1] "Other"    "segment"   "comprised"   "of"    "our"    "active"   
## [7] "pharmaceutical" "ingredient"  "business,which..."

如果你只是想去掉括號表達式中的問題，那麼你不需要任何TM或quanteda：

# exactly as in the question 
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt) 
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..." 

# with added punctuation 
txt2 <- "ingredient (API), business,which..." 
txt3 <- "ingredient (API). New sentence..." 
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt2) 
## [1] "ingredient, business,which..." 
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt3) 
## [1] "ingredient. New sentence..."

的時間越長正則表達式還捕獲括號表達式結束句子或附加標點符號（如逗號）的情況。

來源

2015-10-16 15:51:28

感謝您的回答，但我需要刪除的不僅僅是括號。這個詞也需要刪除。 –

好的我修改了我的答案，參見上文。 –

感謝您的幫助，這也適用！ –

如果只有單一的話，怎麼樣（未經測試）：

removeBracketed <- content_transformer(function(x, ...) {gsub("\\(\\w+\\)", "", x)}) 
tm_map(corpus, removeBracketed)

來源

2015-10-16 11:15:29 dash2

非常感謝！真的行！ –

如何通過tm包刪除單詞中的括號？

回答

相關問題