爲什麼gensim Doc2Vec爲同一個句子提供不同的載體？

我正在使用gensim.models.doc2vec import Doc2Vec使用兩個完全相同的句子（文檔）進行訓練，並且在檢查每個句子的向量時，它們是完全不同的。神經網絡是否有不同的隨機初始化？爲什麼gensim Doc2Vec爲同一個句子提供不同的載體？

# imports 
from gensim.models.doc2vec import LabeledSentence 
from gensim.models.doc2vec import Doc2Vec 
from gensim import utils 

# Document iteration class (turns many documents in to sentences 
# each document being once sentence) 
class LabeledDocs(object): 
    def __init__(self, sources): 
     self.sources = sources 
     flipped = {} 
     # make sure that keys are unique 
     for key, value in sources.items(): 
      if value not in flipped: 
       flipped[value] = [key] 
      else: 
       raise Exception('Non-unique prefix encountered') 

    def __iter__(self): 
     for source, prefix in self.sources.items(): 
      with utils.smart_open(source) as fin: 
       # print fin.read().strip(r"\n") 
       yield LabeledSentence(utils.to_unicode(fin.read()).split(), 
             [prefix]) 

    def to_array(self): 
     self.sentences = [] 
     for source, prefix in self.sources.items(): 
      with utils.smart_open(source) as fin: 
       #print fin, fin.read() 
       self.sentences.append(
        LabeledSentence(utils.to_unicode(fin.read()).split(), 
            [prefix])) 
     return self.sentences 

# play and play3 are names of identical documents (diff gives nothing) 
inp = LabeledDocs({"play":"play", "play3":"play3"}) 
model = Doc2Vec(size=20, window=8, min_count=2, workers=1, alpha=0.025, 
       min_alpha=0.025, batch_words=1) 
model.build_vocab(inp.to_array()) 
for epoch in range(10): 
    model.train(inp) 

# post to this model.docvecs["play"] is very different from 
# model.docvecs["play3"]

這是爲什麼？無論play和play3包含：

foot ball is a sport 
played with a ball where 
teams of 11 each try to 
score on different goals 
and play with the ball

來源

2016-08-16 Francisco Vargas

是，每個句子向量不同的初始化。

特別是在reset_weights方法中。初始化向量一句隨機的代碼是這樣的：

for i in xrange(length): 
    # construct deterministic seed from index AND model seed 
    seed = "%d %s" % (model.seed, self.index_to_doctag(i)) 
    self.doctag_syn0[i] = model.seeded_vector(seed)

在這裏你可以看到，每個句子向量利用該模型的隨機種子和句子的標籤初始化。因此，在你的示例play和play3中導致不同的向量是有意義的。

但是，如果你正確地訓練模型，我會期望兩個向量最終彼此非常接近。

來源

2016-09-07 12:38:05

爲什麼gensim Doc2Vec爲同一個句子提供不同的載體？

回答

相關問題