0
我計算餘弦相似性基於TF-IDF矩陣:規格化餘弦相似性值來計算基於TF-IDF
tfidf_vectorizer_desc = TfidfVectorizer(min_df=5, max_df=0.8, use_idf=True, smooth_idf=True, sublinear_tf=False, tokenizer=tokenize_and_stem)
%time tfidf_matrix_desc = tfidf_vectorizer_desc.fit_transform(descriptions) #fit the vectorizer to text
sim_desc = cosine_similarity(tfidf_matrix_desc)
然而,sim_desc包含大於1.0的相似性(見下文)。據我所知,cosine_similarity返回0到1之間的值。在這種情況下,我是否需要規範化餘弦相似度分數?
sim_desc = cosine_similarity(tfidf_matrix_desc)
print(np.where(sim_desc < 0))
print(np.where(sim_desc > 1))
print(format(np.amax(sim_desc), '.20g'),format(np.amin(sim_desc), '.20g'))
(array([], dtype=int64), array([], dtype=int64))
(array([ 0, 0, 0, ..., 1496, 1496, 1497]), array([ 0, 1, 735, ..., 1495, 1496, 1497]))
1.0000000000000006661 0
在正空間(特徵值),餘弦SIM是0-1之間... – kitchenprinzessin
同意你正在使用的特徵向量,鑑於假設。 – unaki