2013-04-03 42 views
1

python中是否存在一些考慮口音的編輯距離。 凡爲例持有以下屬性編輯與口音的距離

d('ab', 'ac') > d('àb', 'ab') > 0 
+1

將不僅僅與非更換重音字母在兩個字符串中重讀,然後計算距離工作? – Dogbert

+0

我第二。使用Unidecode可能有所幫助:https://pypi.python.org/pypi/Unidecode/0.04.1 –

+0

好的謝謝,但在這一點上,我有d('àa','aa')= 0。 – vigte

回答

5

隨着Levenshtein module

In [1]: import unicodedata, string 

In [2]: from Levenshtein import distance 

In [3]: def remove_accents(data): 
    ...:  return ''.join(x for x in unicodedata.normalize('NFKD', data) 
    ...:        if x in string.ascii_letters).lower() 

In [4]: def norm_dist(s1, s2): 
    ...:  norm1, norm2 = remove_accents(s1), remove_accents(s2) 
    ...:  d1, d2 = distance(s1, s2), distance(norm1, norm2) 
    ...:  return (d1+d2)/2. 

In [5]: norm_dist(u'ab', u'ac') 
Out[5]: 1.0 

In [6]: norm_dist(u'àb', u'ab') 
Out[6]: 0.5 
2

Unicode允許的重音字符到基座字符加一個組合重音字符分解;例如à分解成a後跟一個組合的重音。

要使用標準化形式NFKD轉換這兩個字符串,NFKD分解重音字符並將兼容性字符轉換爲其規範形式,然後使用編輯距離度量標準對插入和刪除之上的替換進行排名。

1

下面是基於difflibunicodedata的解決方案,沒有任何依賴性:

import unicodedata 
from difflib import Differ 

# function taken from https://stackoverflow.com/a/517974/1222951 
def remove_accents(input_str): 
    nfkd_form = unicodedata.normalize('NFKD', input_str) 
    only_ascii = nfkd_form.encode('ASCII', 'ignore').decode() 
    return only_ascii 

def compare(wrong, right): 
    # normalize both strings to make sure equivalent (but 
    # different) unicode characters are canonicalized 
    wrong = unicodedata.normalize('NFKC', wrong) 
    right = unicodedata.normalize('NFKC', right) 

    num_diffs = 0 
    index = 0 
    differences = list(Differ().compare(wrong, right)) 
    while True: 
     try: 
      diff = differences[index] 
     except IndexError: 
      break 

     # diff is a string like "+ a" (meaning the character "a" was inserted) 
     # extract the operation and the character 
     op = diff[0] 
     char = diff[-1] 

     # if the character isn't equal in both 
     # strings, increase the difference counter 
     if op != ' ': 
      num_diffs += 1 

     # if a character is wrong, there will be two operations: one 
     # "+" and one "-" operation 
     # we want to count this as a single mistake, not as two mistakes 
     if op in '+-': 
      try: 
       next_diff = differences[index+1] 
      except IndexError: 
       pass 
      else: 
       next_op = next_diff[0] 
       if next_op in '+-' and next_op != op: 
        # skip the next operation, we don't want to count 
        # it as another mistake 
        index += 1 

        # we know that the character is wrong, but 
        # how wrong is it? 
        # if the only difference is the accent, it's 
        # a minor mistake 
        next_char = next_diff[-1] 
        if remove_accents(char) == remove_accents(next_char): 
         num_diffs -= 0.5 

     index += 1 

    # output the difference as a ratio of 
    # (# of wrong characters)/(length of longest input string) 
    return num_diffs/max(len(wrong), len(right)) 

測試:

for w, r in (('ab','ac'), 
      ('àb','ab'), 
      ('être','etre'), 
      ('très','trés'), 
      ): 
    print('"{}" and "{}": {}% difference'.format(w, r, compare(w, r)*100)) 
"ab" and "ac": 50.0% difference 
"àb" and "ab": 25.0% difference 
"être" and "etre": 12.5% difference 
"très" and "trés": 12.5% difference