我有兩個類型的字典:的Python:計算兩個類型的字典的餘弦相似度更快
d1 = {1234: 4, 125: 7, ...}
d2 = {1234: 8, 1288: 5, ...}
http://stardict.sourceforge.net/Dictionaries.php下載的長度爲10至40000。變化要計算我使用此功能的餘弦相似性:
from scipy.linalg import norm
def simple_cosine_sim(a, b):
if len(b) < len(a):
a, b = b, a
res = 0
for key, a_value in a.iteritems():
res += a_value * b.get(key, 0)
if res == 0:
return 0
try:
res = res/norm(a.values())/norm(b.values())
except ZeroDivisionError:
res = 0
return res
可以更快地計算相似度嗎?
UPD:使用Cython +重寫代碼+速度提高15%。感謝@Davidmh
from scipy.linalg import norm
def fast_cosine_sim(a, b):
if len(b) < len(a):
a, b = b, a
cdef long up, key
cdef int a_value, b_value
up = 0
for key, a_value in a.iteritems():
b_value = b.get(key, 0)
up += a_value * b_value
if up == 0:
return 0
return up/norm(a.values())/norm(b.values())
我已經評論了你用Cython代碼,增加了一種替代方法。我希望這有幫助。 – Davidmh