2015-10-16 127 views
2

比方說,我有文本中的一部分這樣的文件:如何通過tm包刪除單詞中的括號?

"Other segment comprised of our active pharmaceutical ingredient (API) business,which..." 

我想刪除「(API)」,它需要

corpus <- tm_map(corpus, removePunctuation) 

之前完成取出後「(API)」,它應該是這個樣子如下:

"Other segment comprised of our active pharmaceutical ingredient business,which..." 

我搜索了很久,但所有我能找到大約只有刪除括號,這個詞的答案中的我不知道也要在語料庫中出現。

我真的需要有人給我一些提示PLZ。

回答

1

你可以用一個更聰明tokeniser,如在quanteda包,其中removePunct = TRUE會自動刪除括號。

quanteda::tokenize(txt, removePunct = TRUE) 
## tokenizedText object from 1 document. 
## Component 1 : 
## [1] "Other"   "segment"  "comprised"  "of"    "our"   ## "active"   "pharmaceutical" 
## [8] "ingredient"  "API"   "business"  "which"   

補充:

如果你想先tokenise文本,那麼你就需要一個lapply直到gsub我們quanteda添加一個正則表達式valuetyperemoveFeatures.tokenizedTexts()。但是,這會工作:

# tokenized version 
require(quanteda) 
toks <- tokenize(txt, what = "fasterword", simplify = TRUE) 
toks[-grep("^\\(.*\\)$", toks)] 
## [1] "Other"    "segment"   "comprised"   "of"    "our"    "active"   
## [7] "pharmaceutical" "ingredient"  "business,which..." 

如果你只是想去掉括號表達式中的問題,那麼你不需要任何TMquanteda

# exactly as in the question 
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt) 
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..." 

# with added punctuation 
txt2 <- "ingredient (API), business,which..." 
txt3 <- "ingredient (API). New sentence..." 
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt2) 
## [1] "ingredient, business,which..." 
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt3) 
## [1] "ingredient. New sentence..." 

的時間越長正則表達式還捕獲括號表達式結束句子或附加標點符號(如逗號)的情況。

+0

感謝您的回答,但我需要刪除的不僅僅是括號。這個詞也需要刪除。 –

+0

好的我修改了我的答案,參見上文。 –

+0

感謝您的幫助,這也適用! –

1

如果只有單一的話,怎麼樣(未經測試):

removeBracketed <- content_transformer(function(x, ...) {gsub("\\(\\w+\\)", "", x)}) 
tm_map(corpus, removeBracketed) 
+0

非常感謝!真的行! –

相關問題