我有一組文檔,我想知道每個文檔的主題分佈(針對不同的主題數量值)。我從this question拿了一個玩具程序。 我首先使用了gensim提供的LDA,然後我再次給出測試數據作爲我的訓練數據本身,以獲得每個doc在訓練數據中的主題分佈。但我總是得到統一的主題分佈。gensim LDA模塊:在預測時始終獲得統一的主題分佈
下面是我用
import gensim
import logging
logging.basicConfig(filename="logfile",format='%(message)s', level=logging.INFO)
def get_doc_topics(lda, bow):
gamma, _ = lda.inference([bow])
topic_dist = gamma[0]/sum(gamma[0]) # normalize distribution
documents = ['Human machine interface for lab abc computer applications',
'A survey of user opinion of computer system response time',
'The EPS user interface management system',
'System and human system engineering testing of EPS',
'Relation of user perceived response time to error measurement',
'The generation of random binary unordered trees',
'The intersection graph of paths in trees',
'Graph minors IV Widths of trees and well quasi ordering',
'Graph minors A survey']
texts = [[word for word in document.lower().split()] for document in documents]
dictionary = gensim.corpora.Dictionary(texts)
id2word = {}
for word in dictionary.token2id:
id2word[dictionary.token2id[word]] = word
mm = [dictionary.doc2bow(text) for text in texts]
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=2, update_every=1, chunksize=10000, passes=1,minimum_probability=0.0)
newdocs=["human system"]
print lda[dictionary.doc2bow(newdocs)]
newdocs=["Human machine interface for lab abc computer applications"] #same as 1st doc in training
print lda[dictionary.doc2bow(newdocs)]
這裏的玩具代碼輸出:
[(0, 0.5), (1, 0.5)]
[(0, 0.5), (1, 0.5)]
我有一些更多的例子檢查,但所有最終給出相同的等概率的結果。
這裏是產生(即記錄器的輸出)的日誌文件
adding document #0 to Dictionary(0 unique tokens: [])
built Dictionary(42 unique tokens: [u'and', u'minors', u'generation', u'testing', u'iv']...) from 9 documents (total 69 corpus positions)
using symmetric alpha at 0.5
using symmetric eta at 0.5
using serial LDA version on this node
running online LDA training, 2 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50x with a convergence threshold of 0.001000
too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
-5.796 per-word bound, 55.6 perplexity estimate based on a held-out corpus of 9 documents with 69 words
PROGRESS: pass 0, at document #9/9
topiC#0 (0.500): 0.057*"of" + 0.043*"user" + 0.041*"the" + 0.040*"trees" + 0.039*"interface" + 0.036*"graph" + 0.030*"system" + 0.027*"time" + 0.027*"response" + 0.026*"eps"
topiC#1 (0.500): 0.088*"of" + 0.061*"system" + 0.043*"survey" + 0.040*"a" + 0.036*"graph" + 0.032*"trees" + 0.032*"and" + 0.032*"minors" + 0.031*"the" + 0.029*"computer"
topic diff=0.539396, rho=1.000000
它說,「太少了更新,訓練可能不會收斂」這就是我一直提高不傳球到1000,但輸出仍然相同。 (雖然它與收斂無關,但我也嘗試過增加主題)
完美!謝謝 !而且我還需要了解一件事情。我所做的所有這些工作的主要目標,就像問題中提到的那樣,是要獲得主題的主題分佈。有沒有更好的方式,我做了LDA之後得到它,用了我在代碼中使用的小黑客(它將訓練集作爲測試集提供!) – MysticForce