2017-08-02 41 views
0

從純英文文本文件中Gensim 2.2.0創建詞矢量與IMDB電影分級後:Gensim:詞矢量編碼問題

import gensim, logging 
import smart_open, os 
from nltk.tokenize import RegexpTokenizer 

VEC_SIZE = 300 
MIN_COUNT = 5 
WORKERS = 4 
data_path = './data/' 
vectors_path = 'vectors.bin.gz' 

class AllSentences(object): 
    def __init__(self, dirname): 
     self.dirname = dirname 
     self.read_err_cnt = 0 
     self.tokenizer = RegexpTokenizer('[\'a-zA-Z]+', discard_empty=True) 

    def __iter__(self): 
     for fname in os.listdir(self.dirname): 
      print(fname) 
      for line in open(os.path.join(self.dirname, fname)): 
       words = []  
       try: 
        for word in self.tokenizer.tokenize(line): 
         words.append(word) 
        yield words 
       except: 
        self.read_err_cnt += 1 

sentences = AllSentences(data_path) 

培訓和節能模式:

model = gensim.models.Word2Vec(sentences, size=VEC_SIZE, 
           min_count=MIN_COUNT, workers=WORKERS) 
word_vectors = model.wv 
word_vectors.save(vectors_path) 

然後嘗試加載回:

vectors = KeyedVectors.load_word2vec_format(vectors_path, 
                binary=True, 
                unicode_errors='ignore') 

我得到'UnicodeDecodeError:'utf-8'編解碼器無法解碼位置0'異常(參見下文)中的字節0x80。我嘗試的 '編碼' 參數的不同組合,包括'ISO-8859-1''Latin1的''binary = True/False'的不同組合。沒有任何幫助 - 同樣的例外,無論使用什麼參數。哪裏不對?如何使加載向量工作?

例外:

UnicodeDecodeError      Traceback (most recent call last) 
<ipython-input-64-f353fa49685c> in <module>() 
----> 1 w2v = get_w2v_vectors() 

<ipython-input-63-cbbe0a76e837> in get_w2v_vectors() 
     3  vectors = KeyedVectors.load_word2vec_format(word_vectors_path, 
     4              binary=True, 
----> 5              unicode_errors='ignore') 
     6 
     7             #unicode_errors='ignore') 

D:\usr\anaconda\lib\site-packages\gensim\models\keyedvectors.py in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype) 
    204   logger.info("loading projection weights from %s", fname) 
    205   with utils.smart_open(fname) as fin: 
--> 206    header = utils.to_unicode(fin.readline(), encoding=encoding) 
    207    vocab_size, vector_size = map(int, header.split()) # throws for invalid file format 
    208    if limit: 

D:\usr\anaconda\lib\site-packages\gensim\utils.py in any2unicode(text, encoding, errors) 
    233  if isinstance(text, unicode): 
    234   return text 
--> 235  return unicode(text, encoding, errors=errors) 
    236 to_unicode = any2unicode 
    237 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte 

回答

1

如果您保存使用gensim的本地save()法向量,就應該與本地load()方法加載它們。

如果你想載入的載體使用load_word2vec_format(),你需要將它們保存爲save_word2vec_format()。 (您會以這種方式丟失一些信息,例如KeyedVectors.vocab字典項中的確切事件計數。)