我使用topicmodels包中的LDA,我已經在大約30,000個文檔上運行了它,獲得了30個主題,並獲得了主題的前10個單詞,它們看起來非常好。但我想查看哪些文件屬於哪個主題的概率最高,我該怎麼做?LDA與topicmodels,我怎樣才能看到哪些主題不同的文檔屬於?
myCorpus <- Corpus(VectorSource(userbios$bio))
docs <- userbios$twitter_id
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
myStopwords <- c("twitter", "tweets", "tweet", "tweeting", "account")
# remove stopwords from corpus
myCorpus <- tm_map(myCorpus, removeWords, stopwords('english'))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
# stem words
# require(rJava) # needed for stemming function
# library(Snowball) # also needed for stemming function
# a <- tm_map(myCorpus, stemDocument, language = "english")
myDtm <- DocumentTermMatrix(myCorpus, control = list(wordLengths=c(2,Inf), weighting=weightTf))
myDtm2 <- removeSparseTerms(myDtm, sparse=0.85)
dtm <- myDtm2
library(topicmodels)
rowTotals <- apply(dtm, 1, sum)
dtm2 <- dtm[rowTotals>0]
dim(dtm2)
dtm_LDA <- LDA(dtm2, 30)
即將現有的模型分配新文檔的問題已經被問和回答的位置:HTTP://計算器。 com/a/16120518/1036500 – Ben 2013-04-21 08:11:32