0

我有一個Python程序來讀取兩個列表(一個錯誤和其他與正確的數據)。我錯誤列表中的每個元素都需要與我正確列表中的每個元素進行比較。比較後,我得到每一個比較對之間的所有編輯距離。現在我可以找到給定錯誤數據的最小編輯距離,並獲取我的正確數據。Python中的Levenshtein距離只給出1作爲編輯距離

我正在嘗試使用levenshtein距離來計算編輯距離,但它將所有編輯距離都返回爲1,哪怕是錯誤的。

這意味着用於計算levenshtein距離的代碼是不正確的。我正在努力爲此找到解決辦法。幫幫我!

我的代碼

import csv 

def lev(a, b): 
    if not a: return len(b) 
    if not b: return len(a) 
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1) 

if __name__ == "__main__": 

    with open("all_correct_promo.csv","rb") as file1: 
     reader1 = csv.reader(file1) 
     correctPromoList = list(reader1) 
     #print correctPromoList 

    with open("all_extracted_promo.csv","rb") as file2: 
     reader2 = csv.reader(file2) 
     extractedPromoList = list(reader2) 
     #print extractedPromoList 

    incorrectPromo = [] 
    count = 0 
    for extracted in extractedPromoList: 
     if(extracted not in correctPromoList): 
      incorrectPromo.append(extracted) 
     else: 
      count = count + 1 
    #print incorrectPromo 

    for promos in incorrectPromo: 
     for correctPromo in correctPromoList: 
      distance = lev(promos,correctPromo) 
      print promos, correctPromo , distance 
+0

正如我張貼在我的答案,你implmentation似乎是正確的操作(雖然我建議你一個更好的)。如果您需要修正這一問題,請提供您的算法錯誤地返回1的情況(我無法自己重現) – caspillaga

回答

1

的實施是正確的。我測試了這一點:

def lev(a, b): 
    if not a: return len(b) 
    if not b: return len(a) 
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1) 


print lev('abcde','bc') # prints 3, which is correct 
print lev('abc','bc') # prints 1, which is correct 

你的問題,我注意到了你的意見,可能是當你調用方法:

a = ['NSP-212690'] 
b = ['FE SV X'] 

print lev(a,b) # prints 1 which is incorrect because you are comparing arrays, not strings 
print lev(a[0],b[0]) # prints 10, which is correct 

所以,你可以做的是:

之前打電話給「列弗(A,b)」,提取每個數組的第一個元素

def lev(a, b): 
    if not a: return len(b) 
    if not b: return len(a) 
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1) 

a = ['NSP-212690'] 
b = ['FE SV X'] 
a = a[0] # this is the key part 
b = b[0] # and this 
print lev(a,b) # prints 10, which is correct 

無論如何,我不會建議你重新草書實現,因爲性能是veeeery差

我會推薦這實現,而不是(來源:wikipedia-levenshtein

def lev(seq1, seq2): 
    oneago = None 
    thisrow = range(1, len(seq2) + 1) + [0] 
    for x in xrange(len(seq1)): 
     twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1] 
     for y in xrange(len(seq2)): 
      delcost = oneago[y] + 1 
      addcost = thisrow[y - 1] + 1 
      subcost = oneago[y - 1] + (seq1[x] != seq2[y]) 
      thisrow[y] = min(delcost, addcost, subcost) 
    return thisrow[len(seq2) - 1] 

或者這也許稍作修改的版本:

def lev(seq1, seq2): 
    if not a: return len(b) 
    if not b: return len(a) 
    oneago = None 
    thisrow = range(1, len(seq2) + 1) + [0] 
    for x in xrange(len(seq1)): 
     twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1] 
     for y in xrange(len(seq2)): 
      delcost = oneago[y] + 1 
      addcost = thisrow[y - 1] + 1 
      subcost = oneago[y - 1] + (seq1[x] != seq2[y]) 
      thisrow[y] = min(delcost, addcost, subcost) 
    return thisrow[len(seq2) - 1] 
+0

我嘗試了這兩個代碼,並且仍然爲所有比較的促銷代碼提供編輯距離1。 [ 'NSP-212690'] [ 'FE SV X'] 1 [ 'NSP-212690'] [ 'FE SV2 A'] 1 [ 'NSP-212690'] [ 'FE SV2 B'] 1 ['NSP-212690'] ['FE SV2 C'] 1 ['NSP-212690'] ['FE SV2 D'] 1 ['NSP-212690'] ['FE SV2 E'] 1 [ NSP-212690'] ['FLATRATE DS1'] 1 ['NSP-212690'] ['FLATRATE DS1c'] 1 ['NSP-212690'] ['FLATRATE DS3'] 1 ['NSP-212690'] ['FreeMM'] 1 ['NSP-212690'] ['FreeWeb'] 1 [''NSP-212690'] ['FTTCIP10'] 1 – safwan

+0

啊!現在我看到你的問題了!我會在幾秒內更新我的答案......請閱讀! – caspillaga

+0

我剛更新了我的答案。我認爲你的問題是在方法調用上,而不是在方法本身上。您正在比較數組而不是字符串,所以您應該在方法調用之前提取每個數組的第一個(也是唯一)元素。 a = ['NSP-212690'] ---> a [0] ='NSP-212690' – caspillaga

0

有一個Python包可用,實現levenshtein距離:python-levenshtein

要安裝它:

pip install python-levenshtein 

要使用它:

>>> import Levenshtein 
>>> string1 = 'dsfjksdjs' 
>>> string2 = 'dsfiksjsd' 
>>> print Levenshtein.distance(string1, string2) 
3