在word2vec

如何洗牌的話我有這樣一段代碼：在word2vec

import gensim 
import random 


file = open('../../../dataset/output/interaction_jobroles_titles_tags.txt') 

read_data = file.read() 

data = read_data.split('\n') 

sentences = [line.split() for line in data] 
print(len(sentences)) 
print(sentences[1]) 

model = gensim.models.Word2Vec(min_count=1, window=10, size=300, negative=5) 
model.build_vocab(sentences) 

for epoch in range(5): 
    shuffled_sentences = random.shuffle(sentences) 
    model.train(shuffled_sentences) 
    print(epoch) 
    print(model) 

model.save("../../../dataset/output/wordvectors_jobroles_titles_300d_10w_wordshuffling" + '.model')

如果我打印一個簡單的句子，那麼它的輸出是這樣的：

['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']

我需要的是在訓練前洗牌並保存模型。

我不確定我是否以正確的方式編碼。我結束了例外：

Exception in thread Thread-8: 
Traceback (most recent call last): 
    File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner 
    self.run() 
    File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run 
    self._target(*self._args, **self._kwargs) 
    File "/usr/local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 747, in job_producer 
    for sent_idx, sentence in enumerate(sentences): 
    File "/usr/local/lib/python3.5/site-packages/gensim/utils.py", line 668, in __iter__ 
    for document in self.corpus: 
TypeError: 'NoneType' object is not iterable

我想問你，我怎麼洗牌的話。

來源

2016-05-08 ssh26

這是否解決了您的問題？ – PKuhn

Random.shuffle將in-list換行並返回none。出於這個原因，在這次調用之後，你的洗牌句子是None。

來源

2016-05-08 17:09:23 PKuhn

感謝您的回覆，請參閱我的另一篇文章。 – ssh26

model.build_vocab(sentences) 
sentences_list = sentences 
Idx = range(len(sentences_list)) 
print(Idx) 
for epoch in range(5): 
    random.shuffle(sentences) 
    perm_sentences = [sentences_list[i] for i in Idx] 
    model.train(perm_sentences) 
    print(epoch) 
    print(model) 
    model.save("somefile'.model')

這解決了我的問題。

但是如何在一個句子中洗牌單詞？

句子： [ 'JO_3787672'， 'JO_272304'， 'JO_2027410'， 'TI_2969041'， 'TI_2509936'， 'TA_954638'， 'TA_4321623'， 'TA_339347'， 'TA_272304'， 'TA_3017535'，「TA_494116 '，'TA_798840']

我的目標是：如果我檢查最相似的單詞for，比如說''JO_3787672''，那麼每次都會預測從'JO_'開始的單詞。從「TA_」和「TI_」開始的詞語的相似度得分非常小。我懷疑這是因爲數據中的單詞位置（我不確定）。這就是爲什麼我試圖在單詞之間進行混洗（我真的不確定它是否有幫助）。

來源

2016-05-09 08:34:03 ssh26

Word2Vec旨在根據單詞順序或「上下文」確定單詞之間的相似性。你正在尋找的可能是一種袋裝詞的方法。 – Swier

回答

相關問題