2016-11-01 30 views
2

我有一組文檔,我想知道每個文檔的主題分佈(針對不同的主題數量值)。我從this question拿了一個玩具程序。 我首先使用了gensim提供的LDA,然後我再次給出測試數據作爲我的訓練數據本身,以獲得每個doc在訓練數據中的主題分佈。但我總是得到統一的主題分佈。gensim LDA模塊:在預測時始終獲得統一的主題分佈

下面是我用

import gensim 
import logging 
logging.basicConfig(filename="logfile",format='%(message)s', level=logging.INFO) 


def get_doc_topics(lda, bow): 
    gamma, _ = lda.inference([bow]) 
    topic_dist = gamma[0]/sum(gamma[0]) # normalize distribution 

documents = ['Human machine interface for lab abc computer applications', 
      'A survey of user opinion of computer system response time', 
      'The EPS user interface management system', 
      'System and human system engineering testing of EPS', 
      'Relation of user perceived response time to error measurement', 
      'The generation of random binary unordered trees', 
      'The intersection graph of paths in trees', 
      'Graph minors IV Widths of trees and well quasi ordering', 
      'Graph minors A survey'] 

texts = [[word for word in document.lower().split()] for document in documents] 
dictionary = gensim.corpora.Dictionary(texts) 
id2word = {} 
for word in dictionary.token2id:  
    id2word[dictionary.token2id[word]] = word 
mm = [dictionary.doc2bow(text) for text in texts] 
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=2, update_every=1, chunksize=10000, passes=1,minimum_probability=0.0) 

newdocs=["human system"] 
print lda[dictionary.doc2bow(newdocs)] 

newdocs=["Human machine interface for lab abc computer applications"] #same as 1st doc in training 
print lda[dictionary.doc2bow(newdocs)] 

這裏的玩具代碼輸出:

[(0, 0.5), (1, 0.5)] 
[(0, 0.5), (1, 0.5)] 

我有一些更多的例子檢查,但所有最終給出相同的等概率的結果。

這裏是產生(即記錄器的輸出)的日誌文件

adding document #0 to Dictionary(0 unique tokens: []) 
built Dictionary(42 unique tokens: [u'and', u'minors', u'generation', u'testing', u'iv']...) from 9 documents (total 69 corpus positions) 
using symmetric alpha at 0.5 
using symmetric eta at 0.5 
using serial LDA version on this node 
running online LDA training, 2 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50x with a convergence threshold of 0.001000 
too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy 
-5.796 per-word bound, 55.6 perplexity estimate based on a held-out corpus of 9 documents with 69 words 
PROGRESS: pass 0, at document #9/9 
topiC#0 (0.500): 0.057*"of" + 0.043*"user" + 0.041*"the" + 0.040*"trees" + 0.039*"interface" + 0.036*"graph" + 0.030*"system" + 0.027*"time" + 0.027*"response" + 0.026*"eps" 
topiC#1 (0.500): 0.088*"of" + 0.061*"system" + 0.043*"survey" + 0.040*"a" + 0.036*"graph" + 0.032*"trees" + 0.032*"and" + 0.032*"minors" + 0.031*"the" + 0.029*"computer" 
topic diff=0.539396, rho=1.000000 

它說,「太少了更新,訓練可能不會收斂」這就是我一直提高不傳球到1000,但輸出仍然相同。 (雖然它與收斂無關,但我也嘗試過增加主題)

回答

2

問題在於將變量newdocs轉換爲gensim文檔。 dictionary.doc2bow()確實期望一個列表,但一個單詞列表。您提供了一個文檔列表,以便將「人類系統」解釋爲一個詞,但是在訓練集中沒有這樣的詞彙,因此它忽略了它。爲了使我的觀點更清晰的看到下面的代碼的輸出

import gensim 
documents = ['Human machine interface for lab abc computer applications', 
      'A survey of user opinion of computer system response time', 
      'The EPS user interface management system', 
      'System and human system engineering testing of EPS', 
      'Relation of user perceived response time to error measurement', 
      'The generation of random binary unordered trees', 
      'The intersection graph of paths in trees', 
      'Graph minors IV Widths of trees and well quasi ordering', 
      'Graph minors A survey'] 

texts = [[word for word in document.lower().split()] for document in documents] 
dictionary = gensim.corpora.Dictionary(texts) 

print dictionary.doc2bow("human system".split()) 
print dictionary.doc2bow(["human system"]) 
print dictionary.doc2bow(["human"]) 
print dictionary.doc2bow(["foo"]) 

所以糾正上面的代碼所有你所要做的就是按照以下

newdocs = "human system".lower().split() 
newdocs = "Human machine interface for lab abc computer applications".lower().split() 

哦改變newdocs,順便你觀察到的行爲,獲得相同的概率,就是空白文檔的主題分佈,即一個統一的分佈。

+0

完美!謝謝 !而且我還需要了解一件事情。我所做的所有這些工作的主要目標,就像問題中提到的那樣,是要獲得主題的主題分佈。有沒有更好的方式,我做了LDA之後得到它,用了我在代碼中使用的小黑客(它將訓練集作爲測試集提供!) – MysticForce