2013-02-14 39 views
15

我使用topicmodels包中的LDA,我已經在大約30,000個文檔上運行了它,獲得了30個主題,並獲得了主題的前10個單詞,它們看起來非常好。但我想查看哪些文件屬於哪個主題的概率最高,我該怎麼做?LDA與topicmodels,我怎樣才能看到哪些主題不同的文檔屬於?

myCorpus <- Corpus(VectorSource(userbios$bio)) 
docs <- userbios$twitter_id 
myCorpus <- tm_map(myCorpus, tolower) 
myCorpus <- tm_map(myCorpus, removePunctuation) 
myCorpus <- tm_map(myCorpus, removeNumbers) 
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) 
myCorpus <- tm_map(myCorpus, removeURL) 
myStopwords <- c("twitter", "tweets", "tweet", "tweeting", "account") 

# remove stopwords from corpus 
myCorpus <- tm_map(myCorpus, removeWords, stopwords('english')) 
myCorpus <- tm_map(myCorpus, removeWords, myStopwords) 


# stem words 
# require(rJava) # needed for stemming function 
# library(Snowball) # also needed for stemming function 
# a <- tm_map(myCorpus, stemDocument, language = "english") 

myDtm <- DocumentTermMatrix(myCorpus, control = list(wordLengths=c(2,Inf), weighting=weightTf)) 
myDtm2 <- removeSparseTerms(myDtm, sparse=0.85) 
dtm <- myDtm2 

library(topicmodels) 

rowTotals <- apply(dtm, 1, sum) 
dtm2 <- dtm[rowTotals>0] 
dim(dtm2) 
dtm_LDA <- LDA(dtm2, 30) 
+0

即將現有的模型分配新文檔的問題已經被問和回答的位置:HTTP://計算器。 com/a/16120518/1036500 – Ben 2013-04-21 08:11:32

回答

20

如何使用內置數據集。這將以最高的概率向您顯示哪些文檔屬於哪個主題。

library(topicmodels) 
data("AssociatedPress", package = "topicmodels") 

k <- 5 # set number of topics 
# generate model 
lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k) 
# now we have a topic model with 20 docs and five topics 

# make a data frame with topics as cols, docs as rows and 
# cell values as posterior topic distribution for each document 
gammaDF <- as.data.frame([email protected]) 
names(gammaDF) <- c(1:k) 
# inspect... 
gammaDF 
       1   2   3   4   5 
1 8.979807e-05 8.979807e-05 9.996408e-01 8.979807e-05 8.979807e-05 
2 8.714836e-05 8.714836e-05 8.714836e-05 8.714836e-05 9.996514e-01 
3 9.261396e-05 9.996295e-01 9.261396e-05 9.261396e-05 9.261396e-05 
4 9.995437e-01 1.140774e-04 1.140774e-04 1.140774e-04 1.140774e-04 
5 3.573528e-04 3.573528e-04 9.985706e-01 3.573528e-04 3.573528e-04 
6 5.610659e-05 5.610659e-05 5.610659e-05 5.610659e-05 9.997756e-01 
7 9.994345e-01 1.413820e-04 1.413820e-04 1.413820e-04 1.413820e-04 
8 4.286702e-04 4.286702e-04 4.286702e-04 9.982853e-01 4.286702e-04 
9 3.319338e-03 3.319338e-03 9.867226e-01 3.319338e-03 3.319338e-03 
10 2.034781e-04 2.034781e-04 9.991861e-01 2.034781e-04 2.034781e-04 
11 4.810342e-04 9.980759e-01 4.810342e-04 4.810342e-04 4.810342e-04 
12 2.651256e-04 9.989395e-01 2.651256e-04 2.651256e-04 2.651256e-04 
13 1.430945e-04 1.430945e-04 1.430945e-04 9.994276e-01 1.430945e-04 
14 8.402940e-04 8.402940e-04 8.402940e-04 9.966388e-01 8.402940e-04 
15 8.404830e-05 9.996638e-01 8.404830e-05 8.404830e-05 8.404830e-05 
16 1.903630e-04 9.992385e-01 1.903630e-04 1.903630e-04 1.903630e-04 
17 1.297372e-04 1.297372e-04 9.994811e-01 1.297372e-04 1.297372e-04 
18 6.906241e-05 6.906241e-05 6.906241e-05 9.997238e-01 6.906241e-05 
19 1.242780e-04 1.242780e-04 1.242780e-04 1.242780e-04 9.995029e-01 
20 9.997361e-01 6.597684e-05 6.597684e-05 6.597684e-05 6.597684e-05 


# Now for each doc, find just the top-ranked topic 
toptopics <- as.data.frame(cbind(document = row.names(gammaDF), 
    topic = apply(gammaDF,1,function(x) names(gammaDF)[which(x==max(x))]))) 
# inspect... 
toptopics 
     document topic 
1   1  2 
2   2  5 
3   3  1 
4   4  4 
5   5  4 
6   6  5 
7   7  2 
8   8  4 
9   9  1 
10  10  2 
11  11  3 
12  12  1 
13  13  1 
14  14  2 
15  15  1 
16  16  4 
17  17  4 
18  18  3 
19  19  4 
20  20  3 

這就是你想要做的嗎?

帽尖到這樣的回答:https://stat.ethz.ch/pipermail/r-help/2010-August/247706.html

+0

感謝您的回答,但我希望找到新文檔最有可能的主題,而不是我運行LDA的主題。 – d12n 2013-02-15 11:27:10

+1

我剛剛注意到我的問題非常不健全,我會編輯它以使其更清晰,看起來好像我正在詢問您提供答案的確切內容。 – d12n 2013-02-15 12:02:19

+0

也許最好留下你的問題,因爲它是正確的,接受這個答案,然後問你真正的問題作爲一個新的問題。如果你只是修改這個問題,它可能不會比現有的更多的關注。 – Ben 2013-02-15 17:19:00

10

要查看哪些文件屬於哪個話題在主題模型的概率最高,只需使用:

topics(lda) 
1  2  3  4  5  6  7  8  9 10 11 12 
60 41 64 19 94 93 12 64 12 33 59 28 
13 14 15 16 17 18 19 20 21 22 23 24 
87 19 98 69 61 18 27 18 87 96 44 65 
25 26 27 28 29 30 31 32 33 34 35 36 
98 77 19 56 76 51 47 38 55 38 92 96 
37 38 39 40 41 42 43 44 45 46 47 48 
19 19 19 38 79 21 17 21 59 24 49  2 
49 50 51 52 53 54 55 56 57 58 59 60 
66 65 41 36 68 19 70 50 54 37 27 77 

要查看生成的所有主題該文件,只需使用:

terms(lda) 
Topic 1  Topic 2  Topic 3  Topic 4  Topic 5 
"quite"  "food"  "lots"  "come"  "like" 
Topic 6  Topic 7  Topic 8  Topic 9  Topic 10 
    "ever"  "around"  "bar"  "loved"  "new" 

我希望這能回答你的問題!

外部讀寫,可以幫助: http://www.rtexttools.com/1/post/2011/08/getting-started-with-latent-dirichlet-allocation-using-rtexttools-topicmodels.html

雷切爾餘叔巖王

+0

你能解釋一下如何解釋主題結果(lda)嗎? – 2015-03-02 04:37:32

+1

嗨,我鍵入術語(lda),但仍然找到數字而不是單詞 – Lucia 2016-01-16 04:52:56

+0

我相信要獲取命名術語,您必須將姓名附加到輸入數據。當使用具有命名列的'simple_triplet_matrix'作爲輸入時,它適用於我。 – shabbychef 2016-02-23 20:39:10

1
ldaGibbs5 <- LDA(dtm,k,method="Gibbs") 

#get topics 
ldaGibbs5.topics <- as.matrix(topics(ldaGibbs5)) 
write.csv(ldaGibbs5.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv")) 

#get top 10 terms in each topic 
ldaGibbs5.terms <- as.matrix(terms(ldaGibbs5,10)) 
write.csv(ldaGibbs5.terms,file=paste("LDAGibbs",k,"TopicsToTerms.csv")) 

#get probability of each topic in each doc 
topicProbabilities <- as.data.frame([email protected]) 
write.csv(topicProbabilities,file=paste("LDAGibbs",k,"TopicProbabilities.csv")) 
+0

嗨,請在代碼中添加一些解釋,因爲它有助於理解您的代碼。只有代碼答案是不被接受的。 – 2016-10-21 19:12:27