2

我在讀 Similarity Measure 突然間我的整個世界都崩潰了。我已經使用聚類技術實現了一個搜索引擎。對於聚類,我使用K距離度量作爲歐幾里德距離。我還使用餘弦相似度來顯示結果。我得到了驚人的準確結果。但現在我讀到了這個,我做的是規範化文檔向量並計算兩個向量之間的歐式距離,因此我沒有考慮到任何地方的大小。歐幾里德距離或餘弦相似度?

我做錯了什麼?

雖然我認爲較高的期限頻率可以彌補較高的tf-idf值和較高的歸一化tf-idf值,因此應該適當地排名較高。 由於

結果(使用不歸一化矢量,附圖歐幾里德距離)

61.79689257425985 222Proposed Research Details.doc 
144.15451315901478 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc 
72.61392308146608 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc 
72.96125277156261 done_Management strategies for impriing rabi (SKN Math).doc 
65.51734241367222 done_RPFIII_dr.dogra.doc 
66.72042766100921 Evaluation of crops and their varieties (SKN Math).doc 
418.8868087170988 P. VIJAYA KUMAR (DSS).doc 
140.3914521621597 RPF - I PIMS-ICAR project proposal for IASRI.doc 
72.95414421468679 RPF-III__Indo-US_project.doc 
82.25126123574397 220Introduction and objectives.doc 

結果(歸一化矢量,附圖歐幾里德距離)

1.3435369899385359 222Proposed Research Details.doc 
1.1277471087250086 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc 
1.2741267093494966 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc 
1.264154265747389 done_Management strategies for impriing rabi (SKN Math).doc 
1.2902191708899362 done_RPFIII_dr.dogra.doc 
1.3128744973475515 Evaluation of crops and their varieties (SKN Math).doc 
0.4924243033927417 P. VIJAYA KUMAR (DSS).doc 
1.1747048933792805 RPF - I PIMS-ICAR project proposal for IASRI.doc 
1.29150899172647 RPF-III__Indo-US_project.doc 
1.318016051789028 220Introduction and objectives.doc 

結果(數字餘弦相似度)

0.09745417833344654 222Proposed Research Details.doc 
0.36409322938119104 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc 
0.1883005642611103 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc 
0.2009569961963377 done_Management strategies for impriing rabi (SKN Math).doc 
0.16766724553404047 done_RPFIII_dr.dogra.doc 
0.13818027710720598 Evaluation of crops and their varieties (SKN Math).doc 
0.8787591527140649 P. VIJAYA KUMAR (DSS).doc 
0.3100342067353838 RPF - I PIMS-ICAR project proposal for IASRI.doc 
0.16600226214483405 RPF-III__Indo-US_project.doc 
0.13141684361322944 220Introduction and objectives.doc 

結果1和2不同意,而2和3則強烈。更多的相似性,更小的距離。集羣質心向量與每個文檔的文檔向量之間的距離。

事實上,最奇怪的結果是歐幾里德距離爲418,最相似度爲0.87的文件。而歸一化距離變爲0.49並與相似性相符。

+0

關於統計:http://stats.stackexchange.com/questions/35076/euclidean-distance-euclidean-distance-between-unit-vectors-or-cosine-similarity –

+0

這個問題已經被交叉發表在[Cross驗證](http://stats.stackexchange.com/questions/35076/euclidean-distance-bt-unit-vectors-or-cosine-similarity-where-vectors-are-docum),它更適合。 – BoltClock

回答

0

當我從我的信息回顧講座中記住時,兩個向量的歸​​一化導致了歐氏距離以及餘弦相似度的反向排序。