R-bigram tokenizer中的文檔項矩陣不起作用

我正在試圖爲一個語料庫，一個使用unigrams，一個使用bigrams製作兩個文檔項矩陣。然而，二元矩陣當前與單元矩陣相同，我不知道爲什麼。從的ngram包作爲標記生成器，但是這並不工作R-bigram tokenizer中的文檔項矩陣不起作用

docs<-Corpus(DirSource("data", recursive=TRUE)) 

# Get the document term matrices 
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) 
dtm_unigram <- DocumentTermMatrix(docs, control = list(tokenize="words", 
    removePunctuation = TRUE, 
    stopwords = stopwords("english"), 
    stemming = TRUE)) 
dtm_bigram <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer, 
    removePunctuation = TRUE, 
    stopwords = stopwords("english"), 
    stemming = TRUE)) 

inspect(dtm_unigram) 
inspect(dtm_bigram)

我還試圖使用的ngram（X，N = 2）：

的代碼。我如何解決bigram標記化？

來源

2017-03-05 filaments

我也有這個問題，所以如果你找到答案，請讓我知道。 –

答覆遲了一點，對不起 - 但我通過使用VCorpus而不是語料庫得到了這個工作。 – filaments

標記器選項似乎不適用於語料庫（SimpleCorpus）。使用VCorpus來解決問題。

來源

2017-03-28 18:30:48 filaments

爲什麼'VCorpus'在'Corpus'上？還有另一個相關的SO問題[這裏]（https://stackoverflow.com/questions/42757183/creating-n-grams-with-tm-rweka-works-with-vcorpus-but-not-corpus）但沒有'似乎是令人滿意的解釋。 – hongsy

R-bigram tokenizer中的文檔項矩陣不起作用

回答

相關問題