合併兩個CountVectorizer並計算餘弦相似度

-1

我試圖實現在信息檢索論文中描述的一種技術，其中將文檔分解爲向量，然後計算它們的餘弦相似度，就像這裏解釋的那樣：http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/合併兩個CountVectorizer並計算餘弦相似度

在這個例子中，我們有：

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity 

documents = (
    "The sky is blue", 
    "The sun is bright", 
    "The sun in the sky is bright", 
    "We can see the shining sun, the bright sun" 
) 

tfidf_vectorizer = TfidfVectorizer() 
tfidf_matrix = tfidf_vectorizer.fit_transform(documents) 
cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

然而，時不時我會得到一個新的文檔。有沒有辦法計算這個新文檔的餘弦相似度，而不重新生成documents元組和tfidf_matrix？

來源

2017-07-24 Arthur Marques

是的，你可以這樣做：

new_docs = [ 
    "This is new doc 1", 
    "This is new doc 2", 
] 
new_tfidf_matrix = tfidf_vectorizer.predict(new_docs) 
cosine_similarity(new_tfidf_matrix, tfidf_matrix)

如果您認爲新的文檔將有新的詞彙訓練數據集不存在，那麼你應該考慮tfidf_vectorizer.fit(all_docs)再培訓的矢量器。

來源

2017-07-24 10:40:39 elyase

合併兩個CountVectorizer並計算餘弦相似度

回答

相關問題