2014-05-07 45 views
2

我正在使用字符串編輯距離(Levenshtein-distance)比較來自眼動跟蹤實驗的掃描路徑。 (現在我正在使用R中的stringdist包)帶有權重的Levenshtein距離/鄰接懲罰

基本上,字符串的字母表示6x4矩陣中的(凝視)位置。該矩陣被配置如下:

 [,1] [,2] [,3] [,4] 
[1,] 'a' 'g' 'm' 's' 
[2,] 'b' 'h' 'n' 't' 
[3,] 'c' 'i' 'o' 'u' 
[4,] 'd' 'j' 'p' 'v' 
[5,] 'e' 'k' 'q' 'w' 
[6,] 'f' 'l' 'r' 'x' 

如果我使用基本Levenshtein距離比較字符串,的ag在一個字符串的比較給出了同樣的估計作爲ax comparicon。

如:

'abc' compared to 'agc' -> 1 
'abc' compared to 'axc' -> 1 

這意味着字符串同樣(DIS)類似

我希望能夠把權重上的方式字符串比較的是,在基體結合鄰接。例如。 ax之間的距離應加權爲大於ag之間的距離。

一種方法是計算矩陣中一個字母到另一個字母的「步行」(水平和垂直步長),併除以最大「步行」距離(即從ax)。例如。從「a」到「g」的距離爲「1」,從ax的距離爲8,結果分別爲1/8和1。

有沒有一種方法來實現這一點(在R或python)?

回答

4

您需要Wagner-Fisher algorithm版本的內部循環使用非單位成本。即其中通常的算法具有+1,使用+del_cost(a[i])等,並將del_cost,ins_costsub_cost定義爲採用一個或兩個符號(可能只是表查找)的函數。

2

如果有人有同樣的「問題」,這是我的解決方案。我爲Kyle Gorman編寫的Wagner-Fischer算法的python實現添加了一個附件。

這個插件是在_dist函數中它的權重函數和實現。

#!/usr/bin/env python 
# wagnerfischer.py: Dynamic programming Levensthein distance function 
# Kyle Gorman <[email protected]> 
# 
# Based on: 
# 
# Robert A. Wagner and Michael J. Fischer (1974). The string-to-string 
# correction problem. Journal of the ACM 21(1):168-173. 
# 
# The thresholding function was inspired by BSD-licensed code from 
# Babushka, a Ruby tool by Ben Hoskings and others. 
# 
# Unlike many other Levenshtein distance functions out there, this works 
# on arbitrary comparable Python objects, not just strings. 


try: # use numpy arrays if possible... 
    from numpy import zeros 
    def _zeros(*shape): 
     """ like this syntax better...a la MATLAB """ 
     return zeros(shape) 

except ImportError: # otherwise do this cute solution 
    def _zeros(*shape): 
     if len(shape) == 0: 
      return 0 
     car = shape[0] 
     cdr = shape[1:] 
     return [_zeros(*cdr) for i in range(car)] 

def weight(A,B, weights): 
    if weights == True: 
     from numpy import matrix 
     from numpy import where 
     # cost_weight defines the matrix structure of the AOI-placement 
     cost_weight = matrix([["a","b","c","d","e","f"],["g","h","i","j","k","l"], 
     ["m","n","o","p","q","r"],["s","t","u","v","w","x"]]) 

     max_walk = 8.00 # defined as the maximum posible distance between letters in 
         # the cost_weight matrix 

     indexA = where(cost_weight==A) 
     indexB = where(cost_weight==B) 

     walk = abs(indexA[0][0]-indexB[0][0])+abs(indexA[1][0]-indexB[1][0]) 

     w = walk/max_walk 

     return w 
    else: 
     return 1 

def _dist(A, B, insertion, deletion, substitution, weights=True): 
    D = _zeros(len(A) + 1, len(B) + 1) 
    for i in xrange(len(A)): 
     D[i + 1][0] = D[i][0] + deletion * weight(A[i],B[0], weights) 
    for j in xrange(len(B)): 
     D[0][j + 1] = D[0][j] + insertion * weight(A[0],B[j], weights) 
    for i in xrange(len(A)): # fill out middle of matrix 
     for j in xrange(len(B)): 
      if A[i] == B[j]: 
       D[i + 1][j + 1] = D[i][j] # aka, it's free. 
      else: 
       D[i + 1][j + 1] = min(D[i + 1][j] + insertion * weight(A[i],B[j], weights), 
             D[i][j + 1] + deletion * weight(A[i],B[j], weights), 
             D[i][j]  + substitution * weight(A[i],B[j], weights)) 
    return D 

def _dist_thresh(A, B, thresh, insertion, deletion, substitution): 
    D = _zeros(len(A) + 1, len(B) + 1) 
    for i in xrange(len(A)): 
     D[i + 1][0] = D[i][0] + deletion 
    for j in xrange(len(B)): 
     D[0][j + 1] = D[0][j] + insertion 
    for i in xrange(len(A)): # fill out middle of matrix 
     for j in xrange(len(B)): 
      if A[i] == B[j]: 
       D[i + 1][j + 1] = D[i][j] # aka, it's free. 
      else: 
       D[i + 1][j + 1] = min(D[i + 1][j] + insertion, 
             D[i][j + 1] + deletion, 
             D[i][j]  + substitution) 
     if min(D[i + 1]) >= thresh: 
      return 
    return D 

def _levenshtein(A, B, insertion, deletion, substitution): 
    return _dist(A, B, insertion, deletion, substitution)[len(A)][len(B)] 

def _levenshtein_ids(A, B, insertion, deletion, substitution): 
    """ 
    Perform a backtrace to determine the optimal path. This was hard. 
    """ 
    D = _dist(A, B, insertion, deletion, substitution) 
    i = len(A) 
    j = len(B) 
    ins_c = 0 
    del_c = 0 
    sub_c = 0 
    while True: 
     if i > 0: 
      if j > 0: 
       if D[i - 1][j] <= D[i][j - 1]: # if ins < del 
        if D[i - 1][j] < D[i - 1][j - 1]: # if ins < m/s 
         ins_c += 1 
        else: 
         if D[i][j] != D[i - 1][j - 1]: # if not m 
          sub_c += 1 
         j -= 1 
        i -= 1 
       else: 
        if D[i][j - 1] <= D[i - 1][j - 1]: # if del < m/s 
         del_c += 1 
        else: 
         if D[i][j] != D[i - 1][j - 1]: # if not m 
          sub_c += 1 
         i -= 1 
        j -= 1 
      else: # only insert 
       ins_c += 1 
       i -= 1 
     elif j > 0: # only delete 
      del_c += 1 
      j -= 1 
     else: 
      return (ins_c, del_c, sub_c) 


def _levenshtein_thresh(A, B, thresh, insertion, deletion, substitution): 
    D = _dist_thresh(A, B, thresh, insertion, deletion, substitution) 
    if D != None: 
     return D[len(A)][len(B)] 

def levenshtein(A, B, thresh=None, insertion=1, deletion=1, substitution=1): 
    """ 
    Compute levenshtein distance between iterables A and B 
    """ 
    # basic checks 
    if len(A) == len(B) and A == B: 
     return 0  
    if len(B) > len(A): 
     (A, B) = (B, A) 
    if len(A) == 0: 
     return len(B) 
    if thresh: 
     if len(A) - len(B) > thresh: 
      return 
     return _levenshtein_thresh(A, B, thresh, insertion, deletion, 
                  substitution) 
    else: 
     return _levenshtein(A, B, insertion, deletion, substitution) 

def levenshtein_ids(A, B, insertion=1, deletion=1, substitution=1): 
    """ 
    Compute number of insertions deletions, and substitutions for an 
    optimal alignment. 
    There may be more than one, in which case we disfavor substitution. 
    """ 
    # basic checks 
    if len(A) == len(B) and A == B: 
     return (0, 0, 0) 
    if len(B) > len(A): 
     (A, B) = (B, A) 
    if len(A) == 0: 
     return len(B) 
    else: 
     return _levenshtein_ids(A, B, insertion, deletion, substitution) 
0

看看這個庫:https://github.com/infoscout/weighted-levenshtein(免責聲明:我是作者)。它支持加權Levenshtein距離,加權最佳絃線對齊和加權Damerau-Levenshtein距離。它是用Cython編寫的,以獲得最佳性能,並且可以通過pip install weighted-levenshtein輕鬆安裝。反饋和拉請求是受歡迎的。

用法示例:

import numpy as np 
from weighted_levenshtein import lev 


insert_costs = np.ones(128, dtype=np.float64) # make an array of all 1's of size 128, the number of ASCII characters 
insert_costs[ord('D')] = 1.5 # make inserting the character 'D' have cost 1.5 (instead of 1) 

# you can just specify the insertion costs 
# delete_costs and substitute_costs default to 1 for all characters if unspecified 
print lev('BANANAS', 'BANDANAS', insert_costs=insert_costs) # prints '1.5' 
+1

您好,感謝有助於StackOverflow的。如果包含這個答案,將會更好地符合[指導原則](http://stackoverflow.com/help/how-to-answer),例如,演示如何用圖書館解決問題的簡短代碼示例。乾杯! – nthall