計算多個詞典之間的相似度「分數」

我有一個參考詞典「dictA」，我需要將它與當場生成的n個詞典量進行比較（計算鍵與vules之間的相似度）。每個字典具有相同的長度。爲了討論的緣故，爲了比較它的n個字典是3：dictB，dictC，dictD。計算多個詞典之間的相似度「分數」

這裏是格言的樣子：

dictA={'1':"U", '2':"D", '3':"D", '4':"U", '5':"U",'6':"U"}

這裏有dictB，dictC和dictD什麼樣子：

dictB={'1':"U", '2':"U", '3':"D", '4':"D", '5':"U",'6':"D"} 
dictC={'1':"U", '2':"U", '3':"U", '4':"D", '5':"U",'6':"D"} 
dictD={'1':"D", '2':"U", '3':"U", '4':"U", '5':"D",'6':"D"}

我有一個解決方案，但只是兩個字典的選項：

sharedValue = set(dictA.items()) & set(dictD.items()) 
dictLength = len(dictA) 
scoreOfSimilarity = len(sharedValue) 
similarity = scoreOfSimilarity/dictLength

我的問題是：我如何通過dicti正量迭代與dictA是一個主要字典，我比較別人與onaries。我們的目標是爲每一本字典獲得一個「相似性」值，我將針對主字典進行迭代。

感謝您的幫助。

來源

2016-10-11 lechiffre

1）這些'n'字典是否在列表中？ 2）你如何計算多次迭代的相似度分數（例如平均值）？ – SuperSaiyan

爲什麼不循環遍歷從B到D的字典列表？在解決此問題的同時，您是否希望滿足特定的性能或數據結構限制？ –

大家知道，Python3'dict.items（）'已經可以和'＆'和其他集合運算符一起工作。它不是一個列表，而是一個類似對象的字典項目視圖。 –

下面是一個通用結構 - 假設您可以單獨生成字典，在生成下一個字典之前使用每個字典。這聽起來像你可能想要的。 calculate_similarity將是一個包含上面的「我有解決方案」代碼的函數。

reference = {'1':"U", '2':"D", '3':"D", '4':"U", '5':"U",'6':"U"} 
while True: 
    on_the_spot = generate_dictionary() 
    if on_the_spot is None: 
     break 
    calculate_similarity(reference, on_the_spot)

如果您需要迭代已經生成的字典，那麼您必須將它們放入可迭代的Python結構中。當你生成它們，創建詞典的清單：

victim_list = [ 
    {'1':"U", '2':"U", '3':"D", '4':"D", '5':"U",'6':"D"}, 
    {'1':"U", '2':"U", '3':"U", '4':"D", '5':"U",'6':"D"}, 
    {'1':"D", '2':"U", '3':"U", '4':"U", '5':"D",'6':"D"} 
] 
for on_the_spot in victim_list: 
    # Proceed as above

你熟悉Python的構建發電機？它就像一個函數，返回值爲，產量爲，而不是返回。如果是這樣，請使用它來代替上面的列表。

來源

2016-10-11 22:29:04 Prune

如果你堅持你的解決方案在一個函數中，你可以通過任意兩個字符的名字來調用它。此外，如果通過分解嵌套函數中的參數來對函數進行咖喱，則可以部分應用第一個字典來獲取僅需要第二個字符的函數（或者您可以使用functools.partial），這使得易於映射：

def similarity (a): 
    def _ (b): 
     sharedValue = set(a.items()) & set(b.items()) 
     dictLength = len(a) 
     scoreOfSimilarity = len(sharedValue) 
     return scoreOfSimilarity/dictLength 
    return _

另外：上述的，也可以寫爲經由嵌套的lambda單個表達式：

similarity = lambda a: lambda b: len(set(a.items()) & set(b.items))/len(a)

現在可以得到格言和與地圖其餘部分之間的相似性：

otherDicts = [dictB, dictC, dictD] 
scores = map(similarity(dictA), otherdicts)

現在你可以使用min()（或max()，或其他），從分數列表獲得最佳：

winner = min(scores)

警告：我沒有測試過任何上述的。

來源

2016-10-11 22:36:26

請不要使用「_」作爲函數的名稱，即使它是一個內部函數。 http://stackoverflow.com/questions/5893163/what-is-the-purpose-of-the-single-underscore-variable-in-python – lejlot

感謝大家參與答案。下面是結果做什麼，我需要：

def compareTwoDictionaries(self, absolute, reference, listOfDictionaries): 
    #look only for absolute fit, yes or no 
    if (absolute == True): 
     similarity = reference == listOfDictionaries 
    else: 
     #return items that are the same between two dictionaries 
     shared_items = set(reference.items()) & set(listOfDictionaries.items()) 
     #return the length of the dictionary for further calculation of % 
     dictLength = len(reference) 
     #return the length of shared_items for further calculation of % 
     scoreOfSimilarity = len(shared_items) 
     #return final score: similarity 
     similarity = scoreOfSimilarity/dictLength 
    return similarity

這裏是「victim_list」字典被用作上述功能

for dict in victim_list: 
       output = oandaConnectorCalls.compareTwoDictionaries(False, reference, dict)

「參考」快譯通和呼叫。

來源

2016-10-12 13:55:55 lechiffre

根據您的問題設置，看起來沒有其他選擇循環輸入詞典列表。但是，這裏可以應用多處理技巧。

這是你輸入：

dict_a = {'1': "U", '2': "D", '3': "D", '4': "U", '5': "U", '6': "U"} 
dict_b = {'1': "U", '2': "U", '3': "D", '4': "D", '5': "U", '6': "D"} 
dict_c = {'1': "U", '2': "U", '3': "U", '4': "D", '5': "U", '6': "D"} 
dict_d = {'1': "D", '2': "U", '3': "U", '4': "U", '5': "D", '6': "D"} 
other_dicts = [dict_b, dict_c, dict_d]

我已經包含@ gary_fixler的地圖技術爲similarity1，此外，我將使用循環技術的similarity2功能。

def similarity1(a): 
    def _(b): 
     shared_value = set(a.items()) & set(b.items()) 
     dict_length = len(a) 
     score_of_similarity = len(shared_value) 
     return score_of_similarity/dict_length 
    return _ 

def similarity2(c): 
    a, b = c 
    shared_value = set(a.items()) & set(b.items()) 
    dict_length = len(a) 
    score_of_similarity = len(shared_value) 
    return score_of_similarity/dict_length

我們正在評估3種技術，在這裏：
（1）@ gary_fixler的地圖
（2）通過http://stardict.sourceforge.net/Dictionaries.php下載列表中簡單的循環
（3）多處理器類型的字典

以下是名單執行語句：

print(list(map(similarity1(dict_a), other_dicts))) 
print([similarity2((dict_a, dict_v)) for dict_v in other_dicts]) 

max_processes = int(multiprocessing.cpu_count()/2-1) 
pool = multiprocessing.Pool(processes=max_processes) 
print([x for x in pool.map(similarity2, zip(itertools.repeat(dict_a), other_dicts))])

你會發現所有3種技術產生相同的結果：

[0.5, 0.3333333333333333, 0.16666666666666666] 
[0.5, 0.3333333333333333, 0.16666666666666666] 
[0.5, 0.3333333333333333, 0.16666666666666666]

請注意，對於多處理，您有multiprocessing.cpu_count()/2核心（每個核心具有超線程）。假設您的系統上沒有任何其他程序正在運行，並且您的程序沒有I/O或同步需求（就像我們的問題那樣），您將經常通過multiprocessing.cpu_count()/2-1進程獲得最佳性能，-1用於父進程。

現在，時間3種技術：

print(timeit.timeit("list(map(similarity1(dict_a), other_dicts))", 
        setup="from __main__ import similarity1, dict_a, other_dicts", 
        number=10000)) 

print(timeit.timeit("[similarity2((dict_a, dict_v)) for dict_v in other_dicts]", 
        setup="from __main__ import similarity2, dict_a, other_dicts", 
        number=10000)) 

print(timeit.timeit("[x for x in pool.map(similarity2, zip(itertools.repeat(dict_a), other_dicts))]", 
        setup="from __main__ import similarity2, dict_a, other_dicts, pool", 
        number=10000))

這將產生在我的筆記本電腦，結果如下：

0.07092539698351175 
0.06757041101809591 
1.6528456939850003

你可以看到，基本的循環技術來執行最好的。由於創建進程和來回傳遞數據的開銷，多處理比其他兩種技術明顯更差。這並不意味着多處理在這裏沒有用處。恰恰相反。查看大量輸入字典的結果：

for _ in range(7): 
    other_dicts.extend(other_dicts)

這將字典列表擴展爲384項。下面是此輸入的定時的結果：

7.934810006991029 
8.184540337068029 
7.466550623998046

對於任何較大的一組輸入字典，多處理技術變得最優化。

來源

2016-10-12 16:53:15

計算多個詞典之間的相似度「分數」

回答

相關問題