2016-11-14 26 views
1

我讀被定義爲兩個線之間的文本/線下面的代碼來學習doc2vec model.Each文件:如何解決gensim KeyError當我嘗試擁有文檔的向量?

  • clueweb09-en0001-XX-XXXXX
  • end_clueweb09-en0001-XX-XXXXX

這是我的代碼:

path='/home/work/Step2/test-input/html' 


alldocs = [] # will hold all docs in original order 


for fname in os.listdir(path): 
    with open(path+'/'+fname) as alldata: 
     for line in alldata: 
      docId= line 
      print docId 
      context= alldata.next() 
      #print context 
      tokens = gensim.utils.to_unicode(context).split() 
      end=alldata.next() 
      alldocs.append(LabeledSentence(tokens[:],[docId])) 

model = Doc2Vec(alpha=0.025, min_alpha=0.025) # use fixed learning rate 
model.build_vocab(alldocs) 
for epoch in range(10): 
    model.train(alldocs) 
    model.alpha -= 0.002 # decrease the learning rate 
    model.min_alpha = model.alpha # fix the learning rate, no decay 

# store the model to mmap-able files 
model.save(path+'/my_html_model.doc2vec') 

但我得到的錯誤,當我寫model.docvecs [ 'clueweb09-en0001-01-34238' ]但是當我寫model.docvecs [0]我得到了結果。

這是我得到的錯誤:

Traceback (most recent call last): 
    File "getLearingDoc.py", line 40, in <module> 
    print model.docvecs['clueweb09-en0001-01-34238'] 
    File "/home/flashkar/anaconda/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 341, in __getitem__ 
    return self.doctag_syn0[self._int_index(index)] 
    File "/home/flashkar/anaconda/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 315, in _int_index 
    return self.max_rawint + 1 + self.doctags[index].offset 
KeyError: 'clueweb09-en0001-01-34238' 

我沒有經驗,Python和gensim請告訴我怎樣才能解決這個問題。

回答

0

確定的標記正確'clueweb09-en0001-01-34238' - 沒有雜散的換行符/ etc - 在培訓期間提出嗎?

您可以在model.docvecs.doctags字典的鍵或列表model.docvecs.offset2doctag中看到模型已知的所有字符串doctags。

相關問題