2016-08-16 63 views
2

我正在使用gensim.models.doc2vec import Doc2Vec使用兩個完全相同的句子(文檔)進行訓練,並且在檢查每個句子的向量時,它們是完全不同的。神經網絡是否有不同的隨機初始化?爲什麼gensim Doc2Vec爲同一個句子提供不同的載體?

# imports 
from gensim.models.doc2vec import LabeledSentence 
from gensim.models.doc2vec import Doc2Vec 
from gensim import utils 

# Document iteration class (turns many documents in to sentences 
# each document being once sentence) 
class LabeledDocs(object): 
    def __init__(self, sources): 
     self.sources = sources 
     flipped = {} 
     # make sure that keys are unique 
     for key, value in sources.items(): 
      if value not in flipped: 
       flipped[value] = [key] 
      else: 
       raise Exception('Non-unique prefix encountered') 

    def __iter__(self): 
     for source, prefix in self.sources.items(): 
      with utils.smart_open(source) as fin: 
       # print fin.read().strip(r"\n") 
       yield LabeledSentence(utils.to_unicode(fin.read()).split(), 
             [prefix]) 

    def to_array(self): 
     self.sentences = [] 
     for source, prefix in self.sources.items(): 
      with utils.smart_open(source) as fin: 
       #print fin, fin.read() 
       self.sentences.append(
        LabeledSentence(utils.to_unicode(fin.read()).split(), 
            [prefix])) 
     return self.sentences 

# play and play3 are names of identical documents (diff gives nothing) 
inp = LabeledDocs({"play":"play", "play3":"play3"}) 
model = Doc2Vec(size=20, window=8, min_count=2, workers=1, alpha=0.025, 
       min_alpha=0.025, batch_words=1) 
model.build_vocab(inp.to_array()) 
for epoch in range(10): 
    model.train(inp) 

# post to this model.docvecs["play"] is very different from 
# model.docvecs["play3"] 

這是爲什麼?無論playplay3包含:

foot ball is a sport 
played with a ball where 
teams of 11 each try to 
score on different goals 
and play with the ball 

回答

2

,每個句子向量不同的初始化。

特別是在reset_weights方法中。初始化向量一句隨機的代碼是這樣的:

for i in xrange(length): 
    # construct deterministic seed from index AND model seed 
    seed = "%d %s" % (model.seed, self.index_to_doctag(i)) 
    self.doctag_syn0[i] = model.seeded_vector(seed) 

在這裏你可以看到,每個句子向量利用該模型的隨機種子和句子的標籤初始化。因此,在你的示例playplay3中導致不同的向量是有意義的。

但是,如果你正確地訓練模型,我會期望兩個向量最終彼此非常接近。

相關問題