解決方案
我已經使用https://stackoverflow.com/a/15174569/61903計算兩個cosine similarity字符串(學分@vpekar)作爲相似度的基礎算法。通常我把所有的字符串放到一個列表中。然後我將索引參數i設置爲0,只要它在列表長度的範圍內,就循環遍歷我。在那個循環中,我重複了從i + 1到length(list)的位置p。然後我找到list [i]和list [p]之間的最大餘弦值。兩個文本字符串都將放入一個列表中,以便在以後的相似度計算中不考慮它們。兩個文本字符串都將與餘弦值一起放入結果列表中,數據結構爲VectorResult。
之後,列表按餘弦值排序。我們現在具有唯一的具有遞減餘弦,又名相似值的字符串對。 HTH。
import re
import math
import timeit
from collections import Counter
WORD = re.compile(r'\w+')
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x] ** 2 for x in vec1.keys()])
sum2 = sum([vec2[x] ** 2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator)/denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
class VectorResult(object):
def __init__(self, cosine, text_1, text_2):
self.cosine = cosine
self.text_1 = text_1
self.text_2 = text_2
def __eq__(self, other):
if self.cosine == other.cosine:
return True
return False
def __le__(self, other):
if self.cosine <= other.cosine:
return True
return False
def __ge__(self, other):
if self.cosine >= other.cosine:
return True
return False
def __lt__(self, other):
if self.cosine < other.cosine:
return True
return False
def __gt__(self, other):
if self.cosine > other.cosine:
return True
return False
def main():
start = timeit.default_timer()
texts = []
with open('data.txt', 'r') as f:
texts = f.readlines()
cosmap = []
i = 0
out = []
while i < len(texts):
max_cosine = 0.0
current = None
for p in range(i + 1, len(texts)):
if texts[i] in out or texts[p] in out:
continue
vector1 = text_to_vector(texts[i])
vector2 = text_to_vector(texts[p])
cosine = get_cosine(vector1, vector2)
if cosine > max_cosine:
current = VectorResult(cosine, texts[i], texts[p])
max_cosine = cosine
if current:
out.extend([current.text_1, current.text_2])
cosmap.append(current)
i += 1
cosmap = sorted(cosmap)
for item in reversed(cosmap):
print(item.cosine, item.text_1, item.text_2)
end = timeit.default_timer()
print("Similarity Sorting of {} strings lasted {} s.".format(len(texts), end - start))
if __name__ == '__main__':
main()
結果
我用你sampple在http://pastebin.com/hySkZ4Pn作爲不會忽略測試數據:
1.0000000000000002 NO 15& 16 1ST FLOOR,2ND MAIN ROAD,KHB COLONY,GANDINAGAR YELAHANKA
NO 15& 16 1ST FLOOR,2ND MAIN ROAD,KHB COLONY,GANDINAGAR YELAHANKA
1.0 # 51/3 AGRAHARA YELAHANKA
#51/3 AGRAHARA YELAHANKA
0.9999999999999999 # C M C ROAD,YALAHANKA
# C M C ROAD,YALAHANKA
0.8728715609439696 # 1002/B B B ROAD,YELAHANKA
0,B B ROAD,YELAHANKA
0.8432740427115678 # LAKSHMI COMPLEX C M C ROAD,YALAHANKA
# SRI LAKSHMAN COMPLEX C M C ROAD,YALAHANKA
0.8333333333333335 # 85/1 B B M P OFFICE ROAD,KOGILU YELAHANKA
#85/1 B B M P OFFICE NEAR KOGILU YALAHANKA
0.8249579113843053 # 689 3RD A CROSS SHESHADRIPURAM CALLEGE OPP YELAHANKA
# 715 3RD CROSS A SECTUR SHESHADRIPURAM CALLEGE OPP YELAHANKA
0.8249579113843053 # 10 RAMAIAIA COMPLEX B B ROAD,YALAHANKA
# JAMATI COMPLEX B B ROAD,YALAHANKA
[ SNIPPED ]
Similarity Sorting of 702 strings lasted 8.955146235887025 s.
我不知道有關elasticsearch,但python至少可以做一個計數器的單詞,然後使用它建立一個關鍵函數與排序使用,比如說每個單詞的計數總和term to sort – Copperfield
我已經回答並展示瞭如何使用簡單的(簡單的)python方法來使用餘弦相似度。但是,如果您依賴發佈彈性搜索查詢,則應該查看MLT查詢(https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html )。 – ferdy