如何從R中的文本中刪除像002？？？？？？「？？？？？」這樣的瘋狂字符？

下面是我在我的語料庫做的事情 -如何從R中的文本中刪除像002？？？？？？「？？？？？」這樣的瘋狂字符？

CorpusX = tm_map(CorpusX, content_transformer(tolower)) 
CorpusX = tm_map(CorpusX, removeWords, c("X", stopwords("english"))) 
CorpusX = tm_map(CorpusX, removePunctuation) 
CorpusX = tm_map(CorpusX, stripWhitespace) 
CorpusX = tm_map(CorpusX, removeNumbers) 
CorpusX = tm_map(CorpusX, stemDocument) 

CorpusX = tm_map(CorpusX, PlainTextDocument)

在此之後我做了一個文件，術語矩陣，然後字雲。如果我按照這個流程去做，並且不會嘗試去除問題中提到的那些字符，那麼一切正常。但是，當我嘗試刪除這些字符，我得到這樣的錯誤如下 -

>Error in UseMethod("TermDocumentMatrix", x) : no applicable method 
> for 'TermDocumentMatrix' applied to an object of class 
> "c('DocumentTermMatrix', 'simple_triplet_matrix')"

我在尋找一種有效的方法來處理這樣的字符。

PS-我完全改變了問題的描述，因爲人們感到困惑（我的錯）。感謝您的幫助！

來源

2016-05-19 Sunny

錯誤消息似乎被別的說着什麼，即你的對象有錯誤的類... – Frank

其實我既（DTM和TDM）嘗試，但錯誤是一樣的。我想我需要改變問題描述。 – Sunny

@Frank似乎正在做某事，即您似乎正在使用旨在用於TermDocumentMatrix類對象的函數，而不是DocumentTermMatrix。 tm包具有將文本全部轉換爲TermDocumentMatrix和DocumentTermMatrix的功能。

若要得到您的問題，R is generally not great at handling Unicode。我經常使用Python來解決這些問題，但鏈接似乎有一些解決方案。

來源

2016-05-19 16:16:02 HoHo

處理文本時，是否使用刪除任何非英文字符？

如果你沒有，這是一個如何去做的例子。這裏我們刪除了數字，Puncutation和非英文字符。

removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*","",x) 
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct), lazy = TRUE)

來源

2016-05-19 20:56:38

您應該格式化您的代碼示例以提高可讀性 – ozren1983

如何從R中的文本中刪除像002？？？？？？「？？？？？」這樣的瘋狂字符？

回答

相關問題