找到最好的子集，以匹配給定的字符串

我有一個字符串找到最好的子集，以匹配給定的字符串

s = "mouse"

和

sub_strings = ["m", "o", "se", "e"]

我需要找出什麼是最好的和最短匹配的字符串列表sub_strings子集與s匹配的列表。這樣做的最好方法是什麼？理想的結果將是[「M」，「O」，「SE」]，因爲他們一起拼莫斯

來源

2012-09-12 smor

這裏的正確答案可能取決於字符串長度與子字符串數量的比較。該字符串在實踐中需要多長時間？你有多少個子串？它需要超快嗎？ –

「最好和最短」是什麼意思？ –

字符串可以儘可能大，子字符串也會很大，但所有子字符串都必須是字符串的一部分。 – smor

您可以使用正則表達式：

import re 

def matches(s, sub_strings): 
    sub_strings = sorted(sub_strings, key=len, reverse=True) 
    pattern = '|'.join(re.escape(substr) for substr in sub_strings) 
    return re.findall(pattern, s)

這至少是短而快，但它並不一定能找到匹配的最佳設置;它太貪婪了。例如，

matches("bears", ["bea", "be", "ars"])

回報["bea"]，當它應該返回["be", "ars"]。

解釋代碼：

第一行按長度排序的子串，使最長的字符串出現在列表的開始。這可以確保正則表達式更喜歡較長的匹配。
第二行創建一個由所有子字符串組成的正則表達式模式，由|符號分隔，表示「或」。
第三行僅使用re.findall函數查找給定字符串s中的模式的所有匹配項。

來源

2012-09-12 18:36:00

很好的答案:)。 –

嗯。爲什麼是-1？ –

它的-1 bot或someat ...他只是-1兩個我們的答案沒有評論... –

import difflib 
print difflib.get_close_matches(target_word,list_of_possibles)

但不幸的是它不會對你上面的例子工作，你可以使用編輯距離，而不是.. 。

def levenshtein_distance(first, second): 
    """Find the Levenshtein distance between two strings.""" 
    if len(first) > len(second): 
     first, second = second, first 
    if len(second) == 0: 
     return len(first) 
    first_length = len(first) + 1 
    second_length = len(second) + 1 
    distance_matrix = [[0] * second_length for x in range(first_length)] 
    for i in range(first_length): 
     distance_matrix[i][0] = i 
    for j in range(second_length): 
     distance_matrix[0][j]=j 
    for i in xrange(1, first_length): 
     for j in range(1, second_length): 
      deletion = distance_matrix[i-1][j] + 1 
      insertion = distance_matrix[i][j-1] + 1 
      substitution = distance_matrix[i-1][j-1] 
      if first[i-1] != second[j-1]: 
       substitution += 1 
      distance_matrix[i][j] = min(insertion, deletion, substitution) 
    return distance_matrix[first_length-1][second_length-1] 

sub_strings = ["mo", "m,", "o", "se", "e"] 
s="mouse" 
print sorted(sub_strings,key = lambda x:levenshtein_distance(x,s))[0]

這將永遠給你的「最接近」二字來你的目標（爲最近的一些定義）

萊文斯坦功能被盜：HTTP：//www.korokithakis.net/posts/finding -levenshtein-distance-in-python/

來源

2012-09-12 18:22:21

非常感謝你的努力。這是一個好的開始。我可能沒有正確描述它，但讓我們用另一個例子。假設子字符串只是sub_strings = [「m」，「o」，「se」，「e」]，那麼我期望組合[「m」，「o」，「se」]是結果，因爲他們在一起拼出最接近鼠標組合的mose！我會更新我的示例 – smor

然後se應該是最接近的......哦，我看到你的行爲......不知道......從這開始，從那裏開始工作，你可能需要'itertools.combinations'（或者可能是排列組合。 ..我經常混淆兩個）...它可能會很慢...非常慢... –

爲什麼-1？他編輯了這個問題，說他想要多場比賽...... –

此解決方案基於this answer用戶Running Wild。它使用Stefan Behnel的the acora package使用Aho–Corasick algorithm高效地找到目標中所有子串的匹配項，然後使用dynamic programming找到答案。

import acora 
import collections 

def best_match(target, substrings): 
    """ 
    Find the best way to cover the string `target` by non-overlapping 
    matches with strings taken from `substrings`. Return the best 
    match as a list of substrings in order. (The best match is one 
    that covers the largest number of characters in `target`, and 
    among all such matches, the one using the fewest substrings.) 

    >>> best_match('mouse', ['mo', 'ou', 'us', 'se']) 
    ['mo', 'us'] 
    >>> best_match('aaaaaaa', ['aa', 'aaa']) 
    ['aaa', 'aa', 'aa'] 
    >>> best_match('abracadabra', ['bra', 'cad', 'dab']) 
    ['bra', 'cad', 'bra'] 
    """ 
    # Find all occurrences of the substrings in target and store them 
    # in a dictionary by their position. 
    ac = acora.AcoraBuilder(*substrings).build() 
    matches = collections.defaultdict(set) 
    for match, pos in ac.finditer(target): 
     matches[pos].add(match) 

    n = len(target) 
    # Array giving the best (score, list of matches) found so far, for 
    # each initial substring of the target. 
    best = [(0, []) for _ in xrange(n + 1)] 
    for i in xrange(n): 
     bi = best[i] 
     bj = best[i + 1] 
     if bi[0] > bj[0] or bi[0] == bj[0] and len(bi[1]) < bj[1]: 
      best[i + 1] = bi 
     for m in matches[i]: 
      j = i + len(m) 
      bj = best[j] 
      score = bi[0] + len(m) 
      if score > bj[0] or score == bj[0] and len(bi[1]) < len(bj[1]): 
       best[j] = (score, bi[1] + [m]) 
    return best[n][1]

來源

2012-09-13 10:59:49

找到最好的子集，以匹配給定的字符串

回答

相關問題