更新gensim word2vec模型

我在gensim中有一個word2vec模型，通過98892文檔進行了培訓。對於在句子數組中不存在的任何給定的句子（即，我對該模型進行訓練的集合），我需要用該句子更新模型，以便下次查詢它會給出一些結果。我這樣做是這樣的：更新gensim word2vec模型

new_sentence = ['moscow', 'weather', 'cold'] 
model.train(new_sentence)

和其打印此爲日誌：

2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features 
2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs 
2014-03-01 16:46:58,235 : INFO : training on 10 words took 0.1s, 174 words/s

現在，當我有類似new_sentence大多數陽性查詢（如model.most_similar(positive=new_sentence)）它給出了錯誤：

Traceback (most recent call last): 
File "<pyshell#220>", line 1, in <module> 
model.most_similar(positive=['moscow', 'weather', 'cold']) 
File "/Library/Python/2.7/site-packages/gensim/models/word2vec.py", line 405, in most_similar 
raise KeyError("word '%s' not in vocabulary" % word) 
    KeyError: "word 'cold' not in vocabulary"

這表明「冷」這個詞不是我訓練這個東西的詞彙的一部分（我是對的）嗎？

所以問題是：如何更新模型，以便給出給定新句子的所有可能的相似性？

來源

2014-03-01 user2480542

有人已將genism的「Word2Vec」更新爲「在線Word2Vec」。你可以在哪裏使用在線學習來更新詞彙表並學習新詞彙。我還沒有嘗試過，但檢查出來在： http://rutumulkar.com/blog/2015/word2vec/ –

如果您的模型是使用C工具load_word2vec_format生成的，則無法更新該模型。查看在線培訓Word2Vec Tutorial的word2vec教程部分：

Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.

來源

2014-03-27 20:15:40 fjxx

謝謝大家回覆..我試圖按照在線培訓方法... – Nacho

train()預計句子輸入一個序列，不一個一句話。
train()只有updates weights基於現有詞彙表的現有特徵向量。您不能使用train()添加新詞彙（=新特徵向量）。

來源

2014-05-31 10:23:59 Radim

那麼如何添加新詞彙呢？這絕對不可能？謝謝 – Nacho

@Nacho，[「word2vec算法不支持動態添加新詞。」]（http://rare-technologies.com/word2vec-tutorial/#comment-2281）所以，不，它不是除非您使用新詞彙對整個模型進行再訓練。 – Jason

首先，您不能將新單詞添加到預先訓練好的模型中。

但是，2014年出版的「新」doc2vec模型滿足您的所有要求。您可以使用它來訓練文檔向量，而不是獲取一組單詞向量，然後將它們合併。最好的部分是doc2vec可以在訓練後推斷看不見的句子。雖然模型仍然不可改變，但根據我的實驗，您可以得到相當不錯的推論結果。

來源

2016-08-19 23:52:10 fyraimar

問題是你不能用新的句子重新訓練word2vec模型。只有doc2vec允許。嘗試doc2vec模型。

來源

2016-10-13 00:10:35

從gensim 0.13.3開始，可以使用gensim對Word2Vec進行在線培訓。

model.build_vocab(new_sentences, update=True) 
model.train(new_sentences)

來源

2016-12-02 16:07:33 ksindi

儘管出於某種原因，但實際上這並不起作用。http://stackoverflow.com/questions/42357678/gensim-word2vec-array-dimensions-in-updating-with-online-word-embedding – chase

我沒有執行此問題的問題。我會試着在這個週末看看你的SO帖子。 – ksindi

@chase我回答了你的帖子 – ksindi

更新gensim word2vec模型

回答

相關問題