2017-02-28 75 views
2

我想實現一個Word2Vec CBOW與Keras負採樣,代碼如下發現here產品合併與Keras functionnal API層的Word2Vec模型

EMBEDDING_DIM = 100 

sentences = SentencesIterator('test_file.txt') 
v_gen = VocabGenerator(sentences=sentences, min_count=5, window_size=3, 
         sample_threshold=-1, negative=5) 

v_gen.scan_vocab() 
v_gen.filter_vocabulary() 
reverse_vocab = v_gen.generate_inverse_vocabulary_lookup('test_lookup') 

# Generate embedding matrix with all values between -1/2d, 1/2d 
embedding = np.random.uniform(-1.0/(2 * EMBEDDING_DIM), 
           1.0/(2 * EMBEDDING_DIM), 
           (v_gen.vocab_size + 3, EMBEDDING_DIM)) 

# Creating CBOW model 
# Model has 3 inputs 
# Current word index, context words indexes and negative sampled word indexes 
word_index = Input(shape=(1,)) 
context = Input(shape=(2*v_gen.window_size,)) 
negative_samples = Input(shape=(v_gen.negative,)) 

# All inputs are processed through a common embedding layer 
shared_embedding_layer = (Embedding(input_dim=(v_gen.vocab_size + 3), 
            output_dim=EMBEDDING_DIM, 
            weights=[embedding])) 

word_embedding = shared_embedding_layer(word_index) 
context_embeddings = shared_embedding_layer(context) 
negative_words_embedding = shared_embedding_layer(negative_samples) 

# Now the context words are averaged to get the CBOW vector 
cbow = Lambda(lambda x: K.mean(x, axis=1), 
       output_shape=(EMBEDDING_DIM,))(context_embeddings) 

# Context is multiplied (dot product) with current word and negative 
# sampled words 
word_context_product = merge([word_embedding, cbow], mode='dot') 
negative_context_product = merge([negative_words_embedding, cbow], 
           mode='dot', 
           concat_axis=-1) 

# The dot products are outputted 
model = Model(input=[word_index, context, negative_samples], 
       output=[word_context_product, negative_context_product]) 

# Binary crossentropy is applied on the output 
model.compile(optimizer='rmsprop', loss='binary_crossentropy') 
print(model.summary()) 

model.fit_generator(v_gen.pretraining_batch_generator(reverse_vocab), 
        samples_per_epoch=10, 
        nb_epoch=1) 

不過,我合流部時得到一個錯誤因爲嵌入層是3D張量,而cbow只有2維。我假設我需要將嵌入(它是[?,1,100])重塑爲[1,100],但我無法找到如何使用功能API重新塑形。 我正在使用Tensorflow後端。另外,如果有人可以指向其他實現與凱拉斯(Gensim免費)的CBOW,我很樂意看看它!

謝謝!

編輯:以下是錯誤

Traceback (most recent call last): 
    File "cbow.py", line 48, in <module> 
    word_context_product = merge([word_embedding, cbow], mode='dot') 
    . 
    . 
    . 
ValueError: Shape must be rank 2 but is rank 3 for 'MatMul' (op: 'MatMul') with input shapes: [?,1,100], [?,100]. 
+0

可以顯示錯誤 –

+0

當然,對不起。完全忘了! –

回答

2
ValueError: Shape must be rank 2 but is rank 3 for 'MatMul' (op: 'MatMul') with input shapes: [?,1,100], [?,100]. 

你的確需要重塑word_embedding張量。兩種方法可以做到這一點:

  • 要麼你使用Reshape()層,從keras.layers.core進口的,這樣做是這樣的:

    word_embedding = Reshape((100,))(word_embedding) 
    

    Reshape的參數是與目標形狀的元組。

  • 或者你可以使用Flatten()層,也keras.layers.core進口的,像這樣使用:

    word_embedding = Flatten()(word_embedding) 
    

    無所取作爲一個參數,它只會刪除「空」的尺寸。

這對您有幫助嗎?

編輯:

事實上第二merge()是有點更棘手。合併在Keras中的dot只接受相同等級的張量,所以相同len(shape)。 因此,您要做的是使用Reshape()圖層來添加1個空白尺寸,然後使用功能dot_axes而不是concat_axis,這與dot合併無關。 這是我建議你的解決方案:

word_embedding = shared_embedding_layer(word_index) 
# Shape output = (None,1,emb_size) 
context_embeddings = shared_embedding_layer(context) 
# Shape output = (None, 2*window_size, emb_size) 
negative_words_embedding = shared_embedding_layer(negative_samples) 
# Shape output = (None, negative, emb_size) 

# Now the context words are averaged to get the CBOW vector 
cbow = Lambda(lambda x: K.mean(x, axis=1), 
        output_shape=(EMBEDDING_DIM,))(context_embeddings) 
# Shape output = (None, emb_size) 
cbow = Reshape((1,emb_size))(cbow) 
# Shape output = (None, 1, emb_size) 

# Context is multiplied (dot product) with current word and negative 
# sampled words 
word_context_product = merge([word_embedding, cbow], mode='dot') 
# Shape output = (None, 1, 1) 
word_context_product = Flatten()(word_context_product) 
# Shape output = (None,1) 
negative_context_product = merge([negative_words_embedding, cbow], mode='dot',dot_axes=[2,2]) 
# Shape output = (None, negative, 1) 
negative_context_product = Flatten()(negative_context_product) 
# Shape output = (None, negative) 

它工作嗎? :)

問題來自TF關於矩陣乘法的剛性。與「點」模式合併稱爲後端batch_dot()函數,與Theano相反,TensorFlow要求矩陣具有相同的排名:read here

+0

我確實幫助第一次合併是的,非常感謝!但是我得到了第二個錯誤,因爲'negative_samples'的形狀是(5,)而不是(1,):'ValueError:Shape必須是rank 2但'MatMul'(op:'MatMul')的排名是3,輸入形狀爲:[?,5,100],[?,100]。' 我沒有得到的是這個代碼和Theano很好地配合, Tensorflow .. –

+0

因此編輯:) –

+0

我工作,非常感謝你!謝謝澄清,我對Theano一無所知! –