如何用新句子更新Doc2Vec模型？

我使用維基百科進行Doc2Vec模型培訓。沒有足夠的內存來一次性訓練模型。因爲，當我嘗試用所有句子來構建詞彙表時，我的python就會中斷。如何用新句子更新Doc2Vec模型？

所以，我想把這個過程分成幾塊。我選擇幾個文檔，訓練模型，保存模型，打開舊模型，並嘗試用新句子\標籤更新它。

我的第一次訓練

model = gensim.models.Doc2Vec(min_count=5, window=10, size=300, sample=1e-3, negative=5, workers=3) 

model.build_vocab(sentences.to_array()) 

sentences_list=sentences.to_array() 
Idx=range(len(sentences_list)) 

for epoch in range(10): 
    random.shuffle(Idx) 
    perm_sentences = [sentences_list[i] for i in Idx] 
    model.train(perm_sentences) 

model.save('example')

此代碼代碼運行完美。之後，我做

model = Doc2Vec.load('example') 

sentences_list_new=sentences_new.to_array() 
Idx=range(len(sentences_list_new)) 

for epoch in range(10): 
    random.shuffle(Idx) 
    perm_sentences_new = [sentences_list_new[i] for i in Idx] 
    model.train(perm_sentences_new)

，但我得到警告：

WARNING:gensim.models.word2vec:supplied example count (9999) did not equal expected count (133662)

，而不是增加新詞來建模。

然後我嘗試建立詞彙新詞：

model.build_vocab(sentences_list_new)

但是有這樣的錯誤：

RuntimeError: must sort before initializing vectors/weights

但是...在此之後的新詞是詞彙。

問題在哪裏？

來源

2015-11-25 Татьяна Паскевич

不知道什麼是錯的，但（1）只要把我的頭，我記得有某種Doc2Vec被標記的句子作爲文檔級別，並且兩個級別（句子和詞彙）在同一時間或者僅僅其中一個被訓練。我沒有看到你處理句子（2）Doc2Vec對象有一個函數'model.sort_vocab（）'。不知道這是否解決了它。 – Mai

這也行不通。看看下面的評論。 –

從戈登·摩爾的回答here：

Currently, the model discovers the vocab only once so using build_vocab() is not supported again.

根據sebastien-j在this discussion：

the memory usage should be approximately 8 * size * |V| bytes (plus some overhead).

For |V|=10^7 and size=500, this is 40 GB.

See if your system has enough memory (if it does there could be a python version issue which is unlikely in your case...)

If it doesn't you could try increasing the min_count

來源

2016-06-01 18:03:40

如何用新句子更新Doc2Vec模型？

回答

相關問題