2013-04-28 164 views
1

我擁有大約20000個文檔的語料庫,並且我必須訓練使用LDA進行主題建模的數據集。將LDA應用於使用gensim進行訓練的語料庫

import logging, gensim 

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 
id2word = gensim.corpora.Dictionary('questions.dict') 
mm = gensim.corpora.MmCorpus('questions.mm') 
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, chunksize=3000, passes=20) 
lda.print_topics(20) 

每當我運行這個程序,我碰到過這樣的錯誤:

2013-04-28 09:57:09,750 : INFO : adding document #0 to Dictionary(0 unique tokens) 
2013-04-28 09:57:09,759 : INFO : built Dictionary(11 unique tokens) from 14 documents (total 14 corpus positions) 
2013-04-28 09:57:09,785 : INFO : loaded corpus index from questions.mm.index 
2013-04-28 09:57:09,790 : INFO : initializing corpus reader from questions.mm 
2013-04-28 09:57:09,796 : INFO : accepted corpus with 19188 documents, 15791 features, 106222 non-zero entries 
2013-04-28 09:57:09,802 : INFO : using serial LDA version on this node 
2013-04-28 09:57:09,808 : INFO : running batch LDA training, 100 topics, 20 passes over the supplied corpus of 19188 documents, updating model once every 19188 documents 
2013-04-28 09:57:10,267 : INFO : PROGRESS: iteration 0, at document #3000/19188 

Traceback (most recent call last): 
File "C:/Users/Animesh/Desktop/NLP/topicmodel/lda.py", line 10, in <module> 
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, chunksize=3000, passes=20) 
File "C:\Python27\lib\site-packages\gensim-0.8.6-py2.7.egg\gensim\models\ldamodel.py", line 265, in __init__ 
self.update(corpus) 
File "C:\Python27\lib\site-packages\gensim-0.8.6-py2.7.egg\gensim\models\ldamodel.py", line 445, in update 
self.do_estep(chunk, other) 
File "C:\Python27\lib\site-packages\gensim-0.8.6-py2.7.egg\gensim\models\ldamodel.py", line 365, in do_estep 
gamma, sstats = self.inference(chunk, collect_sstats=True) 
File "C:\Python27\lib\site-packages\gensim-0.8.6-py2.7.egg\gensim\models\ldamodel.py", line 318, in inference 
expElogbetad = self.expElogbeta[:, ids] 
IndexError: index (11) out of range (0<=index<10) in dimension 1 

我甚至試圖改變在LdaModel函數的值,但我總是得到同樣的錯誤!

應該做什麼?

+0

11個獨特的令牌有點可疑。 – alvas 2013-09-18 09:41:33

回答

2

看來你的字典(id2word)與你的語料庫對象(mm)不匹配。

無論出於何種原因,id2word(字令牌的映射給出docID)只包含11個令牌 2013-04-28 09:57:09,759 : INFO : built Dictionary(11 unique tokens) from 14 documents (total 14 corpus positions)

你的文集包含15791層的功能,所以當它看起來與ID> 10的特徵,它失敗。 ids in expElogbetad = self.expElogbeta[:, ids] 是特定文檔中所有單詞id的列表。

我重新運行該語料庫和詞典的創建:

$ python -m gensim.scripts.make_wiki (從gensim LDA教程)。

創建的字典的日誌記錄數據應該表明遠遠超過11個我相信的令牌。我自己遇到類似的問題。

相關問題