我正在嘗試使用文件語料庫中的R創建術語文檔矩陣。但上運行的代碼,我收到此錯誤,然後2個警告:如何去除R中的term-document matrix中的錯誤?
Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
'i, j' invalid
Calls: DocumentTermMatrix ... TermDocumentMatrix.VCorpus -> simple_triplet_matrix -> .Call
In addition: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
scheduled core 1 encountered error in user code, all values of the job will be affected
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
NAs introduced by coercion
我的代碼如下:
library(tm)
library(RWeka)
library(tmcn.word2vec)
#Reading data
data <- read.csv("Train.csv", header=T)
#text <- data$EventDescription
#Pre-processing
corpus <- Corpus(VectorSource(data$EventDescription))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
#dataframe <- data.frame(text=unlist(sapply(corpus,'[',"content")))
#Reading dictionary file
dict <- scan("dictionary.txt", what='character',sep='\n')
#Bigram Tokenization
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 4))
tdm_doc <- DocumentTermMatrix(corpus,control=list(stopwords = dict, tokenize=BigramTokenizer))
tdm_dic <- DocumentTermMatrix(corpus,control=list(tokenize=BigramTokenizer, dictionary=dict))
如SO在其他的答案給出,我試圖安裝SnowballC包,其他上市的想法。我仍然得到同樣的錯誤。任何人都可以在這方面幫助我嗎?提前致謝。
請張貼輸入文件足以讓一個能重現錯誤 – pcantalupo
例如張貼的價值'dput(頭(數據))'。但是,首先嚐試一下,看看只使用'data'的'head'時是否會出錯。 –
看起來像一個平行的問題。檢查這[後](http://stackoverflow.com/questions/25069798/r-tm-in-mclapplycontentx-fun-all-scheduled-cores-encountered-errors)或這[後](http:// stackoverflow。 COM /問題/ 17703553 /雙字母組-代替-的單詞合termdocument矩陣使用-R和 - rweka)。 – phiver