爲什麼在word2vec培訓過程中使用tf.mul？

Word2vec模型使用噪聲對比估計（NCE）損失來訓練模型。爲什麼在word2vec培訓過程中使用tf.mul？

爲什麼在真實樣本logit計算中使用tf.mul，但在負向計算中使用tf.matmul？

2016-07-03 Capemer

僅供參考：'tf.mul'已更改爲'tf.multiply'。請參閱：https://www.tensorflow.org/api_docs/python/tf/multiply –

您可以考慮NCE損失計算的一種方法是作爲一批獨立的二元邏輯迴歸分類問題。在這兩種情況下，我們都執行相同的計算，儘管它看起來並不像第一個那樣。

要告訴你，我們在實際計算同樣的事情，承擔起真正的輸入部分follwoing：

emb_dim = 3 # dimensions of your embedding vector 
batch_size = 2 # number of examples in your trainings batch 
vocab_size = 6 # number of total words in your text 
       # (so your word ids range from 0 - 5)

此外，假設在批處理以下培訓比如：

1 => 0 # given word with word_id=1, I expect word with word_id=0 
1 => 2 # given word with word_id=1, I expect word with word_id=2

然後你的嵌入矩陣example_emb有尺寸[2,3]，你的真實權重矩陣true_w也有尺寸[2,3]，應該看起來像這樣：

example_emb = [ [e1,e2,e3], [e1,e2,e3] ] # [2,3] input word 
true_w  = [ [w1,w2,w3], [w4,w5,w5] ] # [2,3] target word

example_emb是您試圖學習的總字嵌入（emb）的一個子集，而true_w是權重的一個子集（smb_w_t）。 example_emb中的每一行代表並輸入矢量，並且權重中的每一行代表一個目標矢量。因此[e1，e2，e3]是從emb獲得的word_id = 1的輸入詞的詞向量，[w1，w2，w3]是word_id = 0的期望的目標詞的詞向量。

現在直觀地說，你正試圖解決的分類任務是：給定我看到輸入單詞和目標單詞這個觀察是否正確？

兩個分類任務然後是（沒有偏置，並且tensorflow具有這個方便 'sigmoid_cross_entropy_with_logits' 功能，後來施加乙狀結腸）：

logit(1=>0) = dot([e1,e2,e3], transpose([w1,w2,w3]) => 
logit(1=>0) = e1*w1 + e2*w2 + e3*w3 

and 

logit(1=>2) = dot([e1,e2,e3], transpose([w4,w5,w6]) => 
logit(1=>2) = e1*w4 + e2*w5 + e3*w6

我們可以計算[[分對數（1 => 0 ）]，[logit（1 => 2）]]是最簡單的，如果我們執行元素方面的乘法tf.mul（），然後總結每一行。

此計算的輸出將是一個[batch_size，1]矩陣，其中包含正確單詞的邏輯單元。我們知道這個例子的基本事實/標籤（y'），它是1，因爲這些是正確的例子。

true_logits = [ 
    [logit(1=>0)], # first input word of the batch 
    [logit(1=>2)] # second input word of the batch 
]

現在對於你的問題的第二部分，你爲什麼我們使用tf.matmul（）在負採樣，讓我們假設我們繪製3個負樣本（num_sampled = 3）。所以sampled_ids = [3,4,5]。

直觀地看，這意味着你增加6個多訓練樣本的批次，即：

1 => 3 # given word_id=1, do i expect word_id=3? No, because these are negative examples. 
1 => 4 
1 => 5 
1 => 3 # second input word is also word_id=1 
1 => 4 
1 => 5

所以，你看你的sampled_w，這原來是[3,3]矩陣。您的參數現在看起來像這樣：

example_emb = [ [e1,e2,e3], [e1,e2,e3] ] # [2,3] input word 
sampled_w = [ [w6,w7,w8], [w9,w10,w11], [w12,w13,w14] ] # [3,3] sampled target words

與真實情況類似，我們需要的是所有負面訓練示例的logits。例如，對於第一個例子：

logit(1 => 3) = dot([e1,e2,e3], transpose([w6,w7,w8]) => 
logit(1 => 3) = e1*w6 + e2*w7 + e3*w8

現在，在這種情況下，我們才能轉置sampled_w矩陣後，使用矩陣乘法。這是通過在tf.matmul（）調用中使用transpose_b = True參數來實現的。轉置權重矩陣是這樣的：

sampled_w_trans = [ [w6,w9,w12], [w7,w10,w13], [w8,w11,w14] ] # [3,3]

所以現在tf.matmul（）操作將返回一個[batch_size時，3]矩陣，其中每一行都是爲輸入批次的一個例子的logits。每個元素表示分類任務的邏輯。

負採樣的整個結果矩陣包含此：

sampled_logits = [ 
    [logit(1=>3), logit(1,4), logit(1,5)], # first input word of the batch 
    [logit(1=>3), logit(1,4), logit(1,5)] # second input word of the batch 
]

標籤/接地真理的sampled_logits都是零，因爲這些都是負面的例子。

在這兩種情況下，我們執行相同的計算，即計算二進制分類邏輯迴歸（沒有sigmoid，稍後應用）。

來源

2016-08-10 15:39:26 bruThaler

爲什麼在word2vec培訓過程中使用tf.mul？

回答

相關問題