我想實現一個Word2Vec CBOW與Keras負採樣,代碼如下發現here:產品合併與Keras functionnal API層的Word2Vec模型
EMBEDDING_DIM = 100
sentences = SentencesIterator('test_file.txt')
v_gen = VocabGenerator(sentences=sentences, min_count=5, window_size=3,
sample_threshold=-1, negative=5)
v_gen.scan_vocab()
v_gen.filter_vocabulary()
reverse_vocab = v_gen.generate_inverse_vocabulary_lookup('test_lookup')
# Generate embedding matrix with all values between -1/2d, 1/2d
embedding = np.random.uniform(-1.0/(2 * EMBEDDING_DIM),
1.0/(2 * EMBEDDING_DIM),
(v_gen.vocab_size + 3, EMBEDDING_DIM))
# Creating CBOW model
# Model has 3 inputs
# Current word index, context words indexes and negative sampled word indexes
word_index = Input(shape=(1,))
context = Input(shape=(2*v_gen.window_size,))
negative_samples = Input(shape=(v_gen.negative,))
# All inputs are processed through a common embedding layer
shared_embedding_layer = (Embedding(input_dim=(v_gen.vocab_size + 3),
output_dim=EMBEDDING_DIM,
weights=[embedding]))
word_embedding = shared_embedding_layer(word_index)
context_embeddings = shared_embedding_layer(context)
negative_words_embedding = shared_embedding_layer(negative_samples)
# Now the context words are averaged to get the CBOW vector
cbow = Lambda(lambda x: K.mean(x, axis=1),
output_shape=(EMBEDDING_DIM,))(context_embeddings)
# Context is multiplied (dot product) with current word and negative
# sampled words
word_context_product = merge([word_embedding, cbow], mode='dot')
negative_context_product = merge([negative_words_embedding, cbow],
mode='dot',
concat_axis=-1)
# The dot products are outputted
model = Model(input=[word_index, context, negative_samples],
output=[word_context_product, negative_context_product])
# Binary crossentropy is applied on the output
model.compile(optimizer='rmsprop', loss='binary_crossentropy')
print(model.summary())
model.fit_generator(v_gen.pretraining_batch_generator(reverse_vocab),
samples_per_epoch=10,
nb_epoch=1)
不過,我合流部時得到一個錯誤因爲嵌入層是3D張量,而cbow只有2維。我假設我需要將嵌入(它是[?,1,100])重塑爲[1,100],但我無法找到如何使用功能API重新塑形。 我正在使用Tensorflow後端。另外,如果有人可以指向其他實現與凱拉斯(Gensim免費)的CBOW,我很樂意看看它!
謝謝!
編輯:以下是錯誤
Traceback (most recent call last):
File "cbow.py", line 48, in <module>
word_context_product = merge([word_embedding, cbow], mode='dot')
.
.
.
ValueError: Shape must be rank 2 but is rank 3 for 'MatMul' (op: 'MatMul') with input shapes: [?,1,100], [?,100].
可以顯示錯誤 –
當然,對不起。完全忘了! –