你可以用一個更聰明tokeniser,如在quanteda包,其中removePunct = TRUE
會自動刪除括號。
quanteda::tokenize(txt, removePunct = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "Other" "segment" "comprised" "of" "our" ## "active" "pharmaceutical"
## [8] "ingredient" "API" "business" "which"
補充:
如果你想先tokenise文本,那麼你就需要一個lapply
直到gsub
我們quanteda添加一個正則表達式valuetype
到removeFeatures.tokenizedTexts()
。但是,這會工作:
# tokenized version
require(quanteda)
toks <- tokenize(txt, what = "fasterword", simplify = TRUE)
toks[-grep("^\\(.*\\)$", toks)]
## [1] "Other" "segment" "comprised" "of" "our" "active"
## [7] "pharmaceutical" "ingredient" "business,which..."
如果你只是想去掉括號表達式中的問題,那麼你不需要任何TM或quanteda:
# exactly as in the question
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt)
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..."
# with added punctuation
txt2 <- "ingredient (API), business,which..."
txt3 <- "ingredient (API). New sentence..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt2)
## [1] "ingredient, business,which..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt3)
## [1] "ingredient. New sentence..."
的時間越長正則表達式還捕獲括號表達式結束句子或附加標點符號(如逗號)的情況。
感謝您的回答,但我需要刪除的不僅僅是括號。這個詞也需要刪除。 –
好的我修改了我的答案,參見上文。 –
感謝您的幫助,這也適用! –