2016-05-08 29 views
0

如何洗牌的話我有這樣一段代碼:在word2vec

import gensim 
import random 


file = open('../../../dataset/output/interaction_jobroles_titles_tags.txt') 

read_data = file.read() 

data = read_data.split('\n') 

sentences = [line.split() for line in data] 
print(len(sentences)) 
print(sentences[1]) 

model = gensim.models.Word2Vec(min_count=1, window=10, size=300, negative=5) 
model.build_vocab(sentences) 

for epoch in range(5): 
    shuffled_sentences = random.shuffle(sentences) 
    model.train(shuffled_sentences) 
    print(epoch) 
    print(model) 

model.save("../../../dataset/output/wordvectors_jobroles_titles_300d_10w_wordshuffling" + '.model') 

如果我打印一個簡單的句子,那麼它的輸出是這樣的:

['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840'] 

我需要的是在訓練前洗牌並保存模型。

我不確定我是否以正確的方式編碼。我結束了例外:

Exception in thread Thread-8: 
Traceback (most recent call last): 
    File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner 
    self.run() 
    File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run 
    self._target(*self._args, **self._kwargs) 
    File "/usr/local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 747, in job_producer 
    for sent_idx, sentence in enumerate(sentences): 
    File "/usr/local/lib/python3.5/site-packages/gensim/utils.py", line 668, in __iter__ 
    for document in self.corpus: 
TypeError: 'NoneType' object is not iterable 

我想問你,我怎麼洗牌的話。

+0

這是否解決了您的問題? – PKuhn

回答

0

Random.shuffle將in-list換行並返回none。出於這個原因,在這次調用之後,你的洗牌句子是None

+0

感謝您的回覆,請參閱我的另一篇文章。 – ssh26

0
model.build_vocab(sentences) 
sentences_list = sentences 
Idx = range(len(sentences_list)) 
print(Idx) 
for epoch in range(5): 
    random.shuffle(sentences) 
    perm_sentences = [sentences_list[i] for i in Idx] 
    model.train(perm_sentences) 
    print(epoch) 
    print(model) 
    model.save("somefile'.model') 

這解決了我的問題。

但是如何在一個句子中洗牌單詞?

句子: [ 'JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535',「TA_494116 ','TA_798840']

我的目標是: 如果我檢查最相似的單詞for,比如說''JO_3787672'',那麼每次都會預測從'JO_'開始的單詞。從「TA_」和「TI_」開始的詞語的相似度得分非常小。 我懷疑這是因爲數據中的單詞位置(我不確定)。這就是爲什麼我試圖在單詞之間進行混洗(我真的不確定它是否有幫助)。

+0

Word2Vec旨在根據單詞順序或「上下文」確定單詞之間的相似性。你正在尋找的可能是一種袋裝詞的方法。 – Swier