如何將memoization應用於此算法？

在Python的標準庫中發現difflib.SequenceMatcher類不適合我的需求後，編寫了一個通用「差異」模塊來解決問題空間。經過幾個月的時間思考更多關於它在做什麼之後，遞歸算法似乎在搜索更多的需求，通過重新搜索一個單獨的「搜索線程」也可能檢查過的序列中的相同區域。如何將memoization應用於此算法？

diff模塊的目的是計算一對序列（列表，元組，字符串，字節，bytearray等）之間的差異和相似性。最初的版本比代碼的當前版本慢得多，看到速度增加了十倍。如何將memoization應用於以下代碼？重寫算法以進一步提高速度的最佳方法是什麼？

class Slice: 

    __slots__ = 'prefix', 'root', 'suffix' 

    def __init__(self, prefix, root, suffix): 
     self.prefix = prefix 
     self.root = root 
     self.suffix = suffix 

################################################################################ 

class Match: 

    __slots__ = 'a', 'b', 'prefix', 'suffix', 'value' 

    def __init__(self, a, b, prefix, suffix, value): 
     self.a = a 
     self.b = b 
     self.prefix = prefix 
     self.suffix = suffix 
     self.value = value 

################################################################################ 

class Tree: 

    __slots__ = 'nodes', 'index', 'value' 

    def __init__(self, nodes, index, value): 
     self.nodes = nodes 
     self.index = index 
     self.value = value 

################################################################################ 

def search(a, b): 
    # Initialize startup variables. 
    nodes, index = [], [] 
    a_size, b_size = len(a), len(b) 
    # Begin to slice the sequences. 
    for size in range(min(a_size, b_size), 0, -1): 
     for a_addr in range(a_size - size + 1): 
      # Slice "a" at address and end. 
      a_term = a_addr + size 
      a_root = a[a_addr:a_term] 
      for b_addr in range(b_size - size + 1): 
       # Slice "b" at address and end. 
       b_term = b_addr + size 
       b_root = b[b_addr:b_term] 
       # Find out if slices are equal. 
       if a_root == b_root: 
        # Create prefix tree to search. 
        a_pref, b_pref = a[:a_addr], b[:b_addr] 
        p_tree = search(a_pref, b_pref) 
        # Create suffix tree to search. 
        a_suff, b_suff = a[a_term:], b[b_term:] 
        s_tree = search(a_suff, b_suff) 
        # Make completed slice objects. 
        a_slic = Slice(a_pref, a_root, a_suff) 
        b_slic = Slice(b_pref, b_root, b_suff) 
        # Finish the match calculation. 
        value = size + p_tree.value + s_tree.value 
        match = Match(a_slic, b_slic, p_tree, s_tree, value) 
        # Append results to tree lists. 
        nodes.append(match) 
        index.append(value) 
     # Return largest matches found. 
     if nodes: 
      return Tree(nodes, index, max(index)) 
    # Give caller null tree object. 
    return Tree(nodes, index, 0)

參考：How to optimize a recursive algorithm to not repeat itself?

來源

2010-07-10 Noctis Skytower

正如〜unutbu說，嘗試memoized decorator及以下變化：

@memoized 
def search(a, b): 
    # Initialize startup variables. 
    nodes, index = [], [] 
    a_size, b_size = len(a), len(b) 
    # Begin to slice the sequences. 
    for size in range(min(a_size, b_size), 0, -1): 
     for a_addr in range(a_size - size + 1): 
      # Slice "a" at address and end. 
      a_term = a_addr + size 
      a_root = list(a)[a_addr:a_term] #change to list 
      for b_addr in range(b_size - size + 1): 
       # Slice "b" at address and end. 
       b_term = b_addr + size 
       b_root = list(b)[b_addr:b_term] #change to list 
       # Find out if slices are equal. 
       if a_root == b_root: 
        # Create prefix tree to search. 
        a_pref, b_pref = list(a)[:a_addr], list(b)[:b_addr] 
        p_tree = search(a_pref, b_pref) 
        # Create suffix tree to search. 
        a_suff, b_suff = list(a)[a_term:], list(b)[b_term:] 
        s_tree = search(a_suff, b_suff) 
        # Make completed slice objects. 
        a_slic = Slice(a_pref, a_root, a_suff) 
        b_slic = Slice(b_pref, b_root, b_suff) 
        # Finish the match calculation. 
        value = size + p_tree.value + s_tree.value 
        match = Match(a_slic, b_slic, p_tree, s_tree, value) 
        # Append results to tree lists. 
        nodes.append(match) 
        index.append(value) 
     # Return largest matches found. 
     if nodes: 
      return Tree(nodes, index, max(index)) 
    # Give caller null tree object. 
    return Tree(nodes, index, 0)

對於記憶化，字典是最好的，但它們不能被切成薄片，所以他們在評論指出改爲名單以上。

來源

2010-07-10 21:26:56 kzh

感謝您的幫助！如上所述，該代碼是爲支持切片的序列（列表，元組，字符串，字節，字節陣列和任何定製容器）編寫的。從你的改變看來，正確記憶需要不可變序列。 – 2010-07-11 21:20:51

你可以從Python Decorator Library 使用memoize的裝飾，並使用它像這樣：

@memoized 
def search(a, b):

您第一時間致電search與自變量a,b，結果計算和memoized（保存在緩存中）。第二次使用相同的參數調用search，結果從緩存中返回。

請注意，對於memoized裝飾工作，參數必須是可散列的。如果a和b是數字元組，那麼它們是可散列的。如果它們是列表，那麼在將它們傳遞給search之前，可以將它們轉換爲元組。它看起來不像search需要dicts作爲參數，但如果它們是，則they would not be hashable和memoization裝飾器將無法將結果保存在緩存中。

來源

2010-07-10 20:30:25 unutbu

你的答案可能不是最好的，因爲他的算法使用列表切片...算法本身可能需要重構。 – kzh 2010-07-10 21:00:37

@kzh：你能詳細說明一下嗎？我很抱歉 - 我沒有跟着你。如果列表切片出了什麼問題？ – unutbu 2010-07-10 21:25:18

他需要將字典更改回切片部分的列表。例如，我發佈了我的答案。 – kzh 2010-07-10 21:29:10

如何將memoization應用於此算法？

回答

相關問題