2013-01-09 159 views
2

我有一個程序會告訴我兩個字符串之間的距離,它工作得很好。顯示字符串對齊

例如

word 1 = hello 
word 2 = hi 

具有5成本從一個去其他(的取代成ë是2,和有3個插入)。

基本上,一個插入成本1,刪除成本1,替換成爲2.字符串也可以在字符串中減少成本。

我需要一種方法來記住什麼操作正在發生在什麼點,以便我可以顯示對齊方式。

例如

wax 
S M S(substitute move substitute, cost of 4) 
and 

任何想法或建議?

import sys 
from sys import stdout 


def minEditDist(target, source): 

    # Length of the target strings set to variables 
    n = len(target) 
    m = len(source) 

    distance = [[0 for i in range(m+1)] for j in range(n+1)] 

    for i in range(1,n+1): 
     distance[i][0] = distance[i-1][0] + insertCost(target[i-1]) 

    for j in range(1,m+1): 
     distance[0][j] = distance[0][j-1] + deleteCost(source[j-1]) 


    for i in range(1,n+1): 
     for j in range(1,m+1): 
      distance[i][j] = min(distance[i-1][j]+1, 
           distance[i][j-1]+1, 
           distance[i-1][j-1]+subCost(source[j-1],target[i-1])) 

    # Return the minimum distance using all the table cells 
    return distance[i][j] 

def subCost(x,y): 
    if x == y: 
     return 0 
    else: 
     return 2 

def insertCost(x): 
    return 1 

def deleteCost(x): 
    return 1 

# User inputs the strings for comparison 
# Commented out here because cloud9 won't take input like this 
# word1 = raw_input("Enter A Word: ") 
# word2 = raw_input("Enter The Second Word: ") 
word1 = "wax" 
word2 = "and" 
word1x = word1 
word2x = word2 
# Reassign variables to words with stripped right side whitespace 
word1x = word1x.strip() 
word2x = word2x.strip() 

if(len(word1) > len(word2)): 
    range_num = len(word1) 
else: 
    range_num = len(word2) 

# Display the minimum distance between the two specified strings 
print "The minimum edit distance between S1 and S2 is: ", minEditDist(word1x,word2x), "!" 
print (word1x) 
print (word2x) 
+0

什麼「移動」是什麼意思? – ATOzTOA

回答

1

你可以從這樣的事情開始。

我已經添加了「S」的正確數據。

path = [] 

def minEditDist(target, source): 

    # Length of the target strings set to variables 
    n = len(target) 
    m = len(source) 

    distance = [[0 for i in range(m+1)] for j in range(n+1)] 

    for i in range(1,n+1): 
     distance[i][0] = distance[i-1][0] + insertCost(target[i-1]) 

    for j in range(1,m+1): 
     distance[0][j] = distance[0][j-1] + deleteCost(source[j-1]) 


    for i in range(1,n+1): 
     for j in range(1,m+1): 
      sc = subCost(source[j-1],target[i-1]) 
      distance[i][j] = min(distance[i-1][j]+1, 
           distance[i][j-1]+1, 
           distance[i-1][j-1]+sc) 
      if distance[i-1][j]+1 > distance[i-1][j-1]+sc and distance[i][j-1]+1 > distance[i-1][j-1]+sc: 
       path.append("S"); 

    print path 

    # Return the minimum distance using all the table cells 
    return distance[i][j] 

def subCost(x,y): 
    if x == y: 
     return 0 
    else: 
     return 2 

def insertCost(x): 
    path.append("I") 
    return 1 

def deleteCost(x): 
    path.append("D") 
    return 1 
+0

這不適合你在這裏使用的方式。你在'insertCost'和'deleteCost'的'path'中添加元素,但是這些方法被調用來預填充距離矩陣,因此你在這裏構造的'path'將始終以'I','I'開始, 'I','D','D','D'(三個字)。 – sloth

+0

我已經特別提到在答案中,我只爲'substitute'做過。 – ATOzTOA

1

您計算Levenshtein Distance(或更好,一個加權Levenshtein距離,因爲您的運營成本是不同的:I/D => 1,M => 2)。

爲了獲得操作的順序,一種常用的方法是做某種回溯。

考慮以下方法backtrace *:

... 
# Return the minimum distance using all the table cells 
def backtrace(i, j): 
    if i>0 and j>0 and distance[i-1][j-1] + 2 == distance[i][j]: 
     return backtrace(i-1, j-1) + "S" 
    if i>0 and j>0 and distance[i-1][j-1] == distance[i][j]: 
     return backtrace(i-1, j-1) + "M" 
    if i>0 and distance[i-1][j] + 1 == distance[i][j]: 
     return backtrace(i-1, j) + "D" 
    if j>0 and distance[i][j-1] + 1 == distance[i][j]: 
     return backtrace(i, j-1) + "I" 
    return "" 

return distance[i][j], backtrace(i, j) 

我增加一條,作爲嵌套方法你的方法,所以我不必須通過你的距離矩陣distance作爲參數傳遞給它。

現在你的腳本輸出

S1和S2之間的最小編輯距離爲:(4, '短信')!


還要注意的是,如果你想使用Levenshtein距離在Python,有一個快速的實施名爲pylevenshtein on google code


*可能不是100%準確:-)