Cosine similarity是widely used用於n克計數或TFIDF載體。
from math import pi, acos
def similarity(x, y):
return sum(x[k] * y[k] for k in x if k in y)/sum(v**2 for v in x.values())**.5/sum(v**2 for v in y.values())**.5
餘弦相似性可以被用於計算一個正式的距離度量according to wikipedia。它遵循,你會期望的距離(對稱,非負性,等等)的所有屬性:
def distance_metric(x, y):
return 1 - 2 * acos(similarity(x, y))/pi
這些度量的兩個範圍0和1之間
如果你有tokenizer產生N-從字符串克,你可以使用這些指標是這樣的:
>>> import Tokenizer
>>> tokenizer = Tokenizer(ngrams=2, lower=True, nonwords_set=set(['hello', 'and']))
>>> from Collections import Counter
>>> list(tokenizer('Hello World again and again?'))
['world', 'again', 'again', 'world again', 'again again']
>>> Counter(tokenizer('Hello World again and again?'))
Counter({'again': 2, 'world': 1, 'again again': 1, 'world again': 1})
>>> x = _
>>> Counter(tokenizer('Hi world once again.'))
Counter({'again': 1, 'world once': 1, 'hi': 1, 'once again': 1, 'world': 1, 'hi world': 1, 'once': 1})
>>> y = _
>>> sum(x[k]*y[k] for k in x if k in y)/sum(v**2 for v in x.values())**.5/sum(v**2 for v in y.values())**.5
0.42857142857142855
>>> distance_metric(x, y)
0.28196592805724774
我發現Counter
優雅的內積this SO answer
我很想知道你的問題是否要求距離服從[三角不等式](http://en.wikipedia.org/wiki/Triangle_inequality),如果是的話,你認爲哪些解決方案最令人滿意。 – 2012-11-29 17:20:20