Scipy，tf-idf和餘弦相似性

我想根據使用python的tf-idf矩陣對一些文檔進行聚類。Scipy，tf-idf和餘弦相似性

首先，我使用規範化的tf遵循公式的維基百科定義。 http://en.wikipedia.org/wiki/Tf-idf

Feat_vectors以二維numpy數組開頭，行代表文檔，列代表術語，每個單元格中的值爲每個文檔中每個術語的出現次數。

import numpy as np 

feat_vectors /= np.max(feat_vectors,axis=1)[:,np.newaxis] 
idf = len(feat_vectors)/(feat_vectors != 0).sum(0) 
idf = np.log(idf) 
feat_vectors *= idf

我然後使用SciPy的聚類這些載體：

from scipy.cluster import hierarchy 

clusters = hierarchy.linkage(feat_vectors,method='complete',metric='cosine') 
flat_clusters = hierarchy.fcluster(clusters, 0.8,'inconsistent')

然而，在最後一行它拋出一個錯誤：

ValueError: Linkage 'Z' contains negative distances.

餘弦相似性從-1到1。然而，餘弦相似性狀態的維基百科頁面http://en.wikipedia.org/wiki/Cosine_similarity：

In the case of information retrieval, the cosine similarity of two documents will range >from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative.

所以如果我得到一個消極的相似性，似乎我在計算tf-idf時出錯。任何想法我的錯誤是什麼？

來源

2012-12-03 Fergusmac

看起來像你的'feat_vectors'具有負值。在乘以'idf'之前，或者在採用'np.log'之前，idf的值小於1。 – tiago

矩陣中的最小值爲零。只是餘弦相似度的結果<0。 – Fergusmac

我懷疑的錯誤是在下面一行：

idf = len(feat_vectors)/(feat_vectors != 0).sum(0)

，因爲你的邏輯載體將被轉換成在和一個int len是一個int，你失去精度。替換爲：

idf = float(len(feat_vectors))/(feat_vectors != 0).sum(0)

爲我工作（即生產出我期待的虛擬數據）。其他一切看起來都正確

來源

2012-12-05 14:56:56

我知道這是一箇舊帖子，但似乎最近自己偶然發現了這個問題。事實上，我甚至使用TfidfVectorizer（來自sklearn.feature_extraction.text）來生成TFIDF矩陣，一旦我自己的函數發出這個錯誤。這也沒有幫助。

似乎用於相似性的餘弦度量值會導致負值。我嘗試了歐幾里得，並立即工作。這裏是一個更詳細的答案，我發現相同的鏈接 - https://stackoverflow.com/a/2590194/3228300

希望這會有所幫助。

來源

2015-09-25 02:47:11 vsdaking

Scipy，tf-idf和餘弦相似性

回答

相關問題