E.g.我們培養使用gensim
一個word2vec模型:從gensim解釋否定的Word2Vec相似度
from gensim import corpora, models, similarities
from gensim.models.word2vec import Word2Vec
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
texts = [[word for word in document.lower().split()] for document in documents]
w2v_model = Word2Vec(texts, size=500, window=5, min_count=1)
當我們查詢詞之間的相似性,我們發現負的相似性指標:
>>> w2v_model.similarity('graph', 'computer')
0.046929569156789336
>>> w2v_model.similarity('graph', 'system')
0.063683518562347399
>>> w2v_model.similarity('survey', 'generation')
-0.040026775040430063
>>> w2v_model.similarity('graph', 'trees')
-0.0072684112978664561
我們如何解釋負分?
如果是餘弦相似度不應該是[0,1]
?
Word2Vec.similarity(x,y)
函數的上界和下界是什麼?沒有多少寫在文檔:https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.similarity =(
瞭解一下Python包裝代碼,沒有太:https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L1165
(如果可能的話,請你點我的.pyx
代碼其中相似性函數被實現。)
如果它使用餘弦相似性,那麼範圍是[-1,1]。從維基百科文章:「這是一個方向的判斷,而不是幅度:兩個具有相同方向的向量具有1的餘弦相似性,90°處的兩個向量具有0的相似性,並且兩個徑向相反的向量具有-1,與它們的大小無關。「 –
餘弦相似性可以解釋爲點積。因此,如果兩個詞具有0餘弦相似性,則它們是完全正交的,這意味着它們具有兩個不同的「含義」並且完全不相關。而負相似意味着這兩個詞在組件中是相關的,但是以相反(或負)的方式。 –