我有官方github倉庫中的predict_output_word方法。它只採用用skip-gram訓練過的wod2vec模型,並嘗試通過將所有輸入詞的索引的向量相加來預測中間詞,並且通過輸入詞索引的np_sum的長度來分割該中間詞。然後,考慮輸出並採用softmax獲得預測詞的概率,然後將所有這些概率相加得到最可能的單詞。有沒有更好的方法來處理這個問題,以獲得更好的單詞,因爲這給較短的句子帶來了非常不好的結果。下面的代碼是github的代碼。預測中間詞word2vec
def predict_output_word(model, context_words_list, topn=10):
from numpy import exp, dtype, float32 as REAL,\
ndarray, empty, sum as np_sum,
from gensim import utils, matutils
"""Report the probability distribution of the center word given the context words as input to the trained model."""
if not model.negative:
raise RuntimeError("We have currently only implemented predict_output_word "
"for the negative sampling scheme, so you need to have "
"run word2vec with negative > 0 for this to work.")
if not hasattr(model.wv, 'syn0') or not hasattr(model, 'syn1neg'):
raise RuntimeError("Parameters required for predicting the output words not found.")
word_vocabs = [model.wv.vocab[w] for w in context_words_list if w in model.wv.vocab]
if not word_vocabs:
warnings.warn("All the input context words are out-of-vocabulary for the current model.")
return None
word2_indices = [word.index for word in word_vocabs]
#sum all the indices
l1 = np_sum(model.wv.syn0[word2_indices], axis=0)
if word2_indices and model.cbow_mean:
#l1 = l1/len(word2_indices)
l1 /= len(word2_indices)
prob_values = exp(dot(l1, model.syn1neg.T)) # propagate hidden -> output and take softmax to get probabilities
prob_values /= sum(prob_values)
top_indices = matutils.argsort(prob_values, topn=topn, reverse=True)
return [(model.wv.index2word[index1], prob_values[index1]) for index1 in top_indices] #returning the most probable output words with their probabilities
歡迎來到StackOverflow。請閱讀並遵守幫助文檔中的發佈準則。 [最小,完整,可驗證的示例](http://stackoverflow.com/help/mcve)適用於此處。在發佈您的MCVE代碼並準確描述問題之前,我們無法爲您提供有效的幫助。 我們應該能夠將發佈的代碼粘貼到文本文件中,並重現您描述的問題。特別是,提供一個小數據集,給你帶來麻煩。沒有一個,目前還不清楚問題是算法,訓練的強度還是缺乏可靠的數據。 – Prune