2015-06-11 52 views
0

我有一個csv與每一行作爲文件。我需要對此執行LDA。我有以下代碼:LDA與tm包在R使用bigrams

library(tm) 
library(SnowballC) 
library(topicmodels) 
library(RWeka) 

X = read.csv('doc.csv',sep=",",quote="\"",stringsAsFactors=FALSE) 

corpus <- Corpus(VectorSource(X)) 
corpus <- tm_map(tm_map(tm_map(corpus, stripWhitespace), tolower), stemDocument) 
corpus <- tm_map(corpus, PlainTextDocument) 
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) 
dtm <- DocumentTermMatrix(corpus, control = list(tokenize=BigramTokenizer,weighting=weightTfIdf)) 

此時檢查DTM對象給出

<<DocumentTermMatrix (documents: 52, terms: 477)>> 
Non-/sparse entries: 492/24312 
Sparsity   : 98% 
Maximal term length: 20 
Weighting   : term frequency - inverse document frequency (normalized) (tf-idf) 

現在我繼續在這個

rowTotals <- apply(dtm , 1, sum) 
dtm.new <- dtm[rowTotals> 0, ] 
g = LDA(dtm.new,10,method = 'VEM',control=NULL,model=NULL) 

我碰到下面的錯誤進行LDA

Error in LDA(dtm.new, 10, method = "VEM", control = NULL, model = NULL) : 
    The DocumentTermMatrix needs to have a term frequency weighting 

文檔術語矩陣顯然是加權的。我究竟做錯了什麼 ?

請幫忙。

+0

是,dtm.new仍然是DocumentTermMatrix對象。 – dulla

回答

1

的文檔詞矩陣需要有一個術語頻率加權:

DocumentTermMatrix(corpus, 
        control = list(tokenize = BigramTokenizer, 
          weighting = weightTf))