如何去除R中的term-document matrix中的錯誤？

我正在嘗試使用文件語料庫中的R創建術語文檔矩陣。但上運行的代碼，我收到此錯誤，然後2個警告：如何去除R中的term-document matrix中的錯誤？

Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : 
'i, j' invalid 
Calls: DocumentTermMatrix ... TermDocumentMatrix.VCorpus -> simple_triplet_matrix -> .Call 
In addition: Warning messages: 
1: In mclapply(unname(content(x)), termFreq, control) : 
scheduled core 1 encountered error in user code, all values of the job will be affected 
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : 
NAs introduced by coercion

我的代碼如下：

library(tm) 
library(RWeka) 
library(tmcn.word2vec) 

#Reading data 
data <- read.csv("Train.csv", header=T) 
#text <- data$EventDescription 

#Pre-processing 
corpus <- Corpus(VectorSource(data$EventDescription)) 
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, removePunctuation) 
corpus <- tm_map(corpus, tolower) 
corpus <- tm_map(corpus, PlainTextDocument) 
#dataframe <- data.frame(text=unlist(sapply(corpus,'[',"content"))) 

#Reading dictionary file 
dict <- scan("dictionary.txt", what='character',sep='\n') 

#Bigram Tokenization 
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 4)) 
tdm_doc <- DocumentTermMatrix(corpus,control=list(stopwords = dict, tokenize=BigramTokenizer)) 
tdm_dic <- DocumentTermMatrix(corpus,control=list(tokenize=BigramTokenizer, dictionary=dict))

如SO在其他的答案給出，我試圖安裝SnowballC包，其他上市的想法。我仍然得到同樣的錯誤。任何人都可以在這方面幫助我嗎？提前致謝。

來源

2015-09-11 Athira

請張貼輸入文件足以讓一個能重現錯誤 – pcantalupo

例如張貼的價值'dput（頭（數據））'。但是，首先嚐試一下，看看只使用'data'的'head'時是否會出錯。 –

看起來像一個平行的問題。檢查這[後]（http://stackoverflow.com/questions/25069798/r-tm-in-mclapplycontentx-fun-all-scheduled-cores-encountered-errors）或這[後]（http：// stackoverflow。 COM /問題/ 17703553 /雙字母組-代替-的單詞合termdocument矩陣使用-R和 - rweka）。 – phiver

清理語料庫時發生了類似的錯誤。爲了解決這個問題，我在違規的代碼行後添加了以下內容，並修復了它。一些tm_map函數不返回語料庫...

corpus <- Corpus(VectorSource(corpus))

對於我來說，幹完成後出現問題。我會建議在每個tm_map調用之後嘗試創建一個tdm。這將告訴您哪個清潔步驟導致問題。

祝你好運！

來源

2016-03-22 18:00:50 emilliman5

我試圖診斷tm_map，它以你說的方式產生我的問題。它是這樣的：語料庫< - tm_map（語料庫，PlainTextDocument） – lbcommer

我有讓我DocumnetTermMatrix同樣的問題，我通過刪除如下命令解決了這個問題：

corpus <- tm_map(corpus, PlainTextDocument)

來源

2017-04-22 09:48:44

如何去除R中的term-document matrix中的錯誤？

回答

相關問題