Levenstein距離子串

有沒有一種很好的方法來使用levenstein距離來匹配一個特定的字符串到第二個更長的字符串中的任何區域？Levenstein距離子串

實施例：

str1='aaaaa' 
str2='bbbbbbaabaabbbb' 

if str1 in str2 with a distance < 2: 
    return True

所以在串2的上述例子中的部分是aabaa和distance(str1,str2) < 2所以該語句應該返回True。

我認爲這樣做的唯一方法是每次從str2中取5個字符，與str1進行比較，然後在str2中重複此操作。不幸的是，這看起來效率很低，我需要用這種方式處理大量的數據。

來源

2017-06-06 The Nightman

https://pypi.python.org/pypi/python-Levenshtein/ –

編輯距離只有5 lenght蘇「str2」的所有字符串（例如。兩個較短的，4個字符和更長的6個字符，其可能在1）的Levenstein距離處）？ –

@ Mr.Xcoder這是我的想法，但我需要處理大約10GB的每一行文件，我認爲這會很慢。 –

訣竅是生成適當長度的所有子字符串b，然後比較每一個。

def lev_dist(a,b): 
    length_cost = abs(len(a) - len(b)) 
    diff_cost = sum(1 for (aa, bb) in zip(a,b) if aa != bb) 
    return diff_cost + length_cost 

def all_substr_of_length(n, s): 
    if n > len(s): 
     return [s] 
    else: 
     return [s[i:i+n] for i in range(0, len(s)-n+1)] 

def lev_substr(a, b): 
    """Gives minimum lev distance of all substrings of b and 
    the single string a. 
    """ 

    return min(lev_dist(a, bb) for bb in all_substr_of_length(len(a), b)) 

if lev_substr(str1, str2) < 2: 
    # it works!

來源

2017-06-06 19:31:44

，只是爲了好玩我在Haskell [here]中實現了這個（https://gitlab.com/snippets/1664166） –

訣竅通常是玩插入（用於更短）或刪除（用於更長的時間）成本。您也可以考慮使用Damerau-Levenshtein。 https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

來源

2017-06-06 19:14:43

你可能有一個看regex module支持模糊匹配：

>>> import regex 
>>> regex.search("(aaaaa){s<2}", 'bbbbbbaabaabbbb') 
<regex.Match object; span=(6, 11), match='aabaa', fuzzy_counts=(1, 0, 0)>

既然你正在尋找相等長度的字符串，你也可以做AA Hamming distance這很可能遠遠快於一在相同的兩個字符串編輯距離：

str1='aaaaa' 
str2='bbbbbbaabaabbbb' 
for s in [str2[i:i+len(str1)] for i in range(0,len(str2)-len(str1)+1)]: 
    if sum(a!=b for a,b in zip(str1,s))<2: 
     print s # prints 'aabaa'

來源

2017-06-06 19:42:15 dawg

Levenstein距離子串

回答

相關問題