2010-07-10 85 views
1

在Python的標準庫中發現difflib.SequenceMatcher類不適合我的需求後,編寫了一個通用「差異」模塊來解決問題空間。經過幾個月的時間思考更多關於它在做什麼之後,遞歸算法似乎在搜索更多的需求,通過重新搜索一個單獨的「搜索線程」也可能檢查過的序列中的相同區域。如何將memoization應用於此算法?

diff模塊的目的是計算一對序列(列表,元組,字符串,字節,bytearray等)之間的差異和相似性。最初的版本比代碼的當前版本慢得多,看到速度增加了十倍。如何將memoization應用於以下代碼?重寫算法以進一步提高速度的最佳方法是什麼?


class Slice: 

    __slots__ = 'prefix', 'root', 'suffix' 

    def __init__(self, prefix, root, suffix): 
     self.prefix = prefix 
     self.root = root 
     self.suffix = suffix 

################################################################################ 

class Match: 

    __slots__ = 'a', 'b', 'prefix', 'suffix', 'value' 

    def __init__(self, a, b, prefix, suffix, value): 
     self.a = a 
     self.b = b 
     self.prefix = prefix 
     self.suffix = suffix 
     self.value = value 

################################################################################ 

class Tree: 

    __slots__ = 'nodes', 'index', 'value' 

    def __init__(self, nodes, index, value): 
     self.nodes = nodes 
     self.index = index 
     self.value = value 

################################################################################ 

def search(a, b): 
    # Initialize startup variables. 
    nodes, index = [], [] 
    a_size, b_size = len(a), len(b) 
    # Begin to slice the sequences. 
    for size in range(min(a_size, b_size), 0, -1): 
     for a_addr in range(a_size - size + 1): 
      # Slice "a" at address and end. 
      a_term = a_addr + size 
      a_root = a[a_addr:a_term] 
      for b_addr in range(b_size - size + 1): 
       # Slice "b" at address and end. 
       b_term = b_addr + size 
       b_root = b[b_addr:b_term] 
       # Find out if slices are equal. 
       if a_root == b_root: 
        # Create prefix tree to search. 
        a_pref, b_pref = a[:a_addr], b[:b_addr] 
        p_tree = search(a_pref, b_pref) 
        # Create suffix tree to search. 
        a_suff, b_suff = a[a_term:], b[b_term:] 
        s_tree = search(a_suff, b_suff) 
        # Make completed slice objects. 
        a_slic = Slice(a_pref, a_root, a_suff) 
        b_slic = Slice(b_pref, b_root, b_suff) 
        # Finish the match calculation. 
        value = size + p_tree.value + s_tree.value 
        match = Match(a_slic, b_slic, p_tree, s_tree, value) 
        # Append results to tree lists. 
        nodes.append(match) 
        index.append(value) 
     # Return largest matches found. 
     if nodes: 
      return Tree(nodes, index, max(index)) 
    # Give caller null tree object. 
    return Tree(nodes, index, 0) 

參考:How to optimize a recursive algorithm to not repeat itself?

回答

1

正如〜unutbu說,嘗試memoized decorator及以下變化:

@memoized 
def search(a, b): 
    # Initialize startup variables. 
    nodes, index = [], [] 
    a_size, b_size = len(a), len(b) 
    # Begin to slice the sequences. 
    for size in range(min(a_size, b_size), 0, -1): 
     for a_addr in range(a_size - size + 1): 
      # Slice "a" at address and end. 
      a_term = a_addr + size 
      a_root = list(a)[a_addr:a_term] #change to list 
      for b_addr in range(b_size - size + 1): 
       # Slice "b" at address and end. 
       b_term = b_addr + size 
       b_root = list(b)[b_addr:b_term] #change to list 
       # Find out if slices are equal. 
       if a_root == b_root: 
        # Create prefix tree to search. 
        a_pref, b_pref = list(a)[:a_addr], list(b)[:b_addr] 
        p_tree = search(a_pref, b_pref) 
        # Create suffix tree to search. 
        a_suff, b_suff = list(a)[a_term:], list(b)[b_term:] 
        s_tree = search(a_suff, b_suff) 
        # Make completed slice objects. 
        a_slic = Slice(a_pref, a_root, a_suff) 
        b_slic = Slice(b_pref, b_root, b_suff) 
        # Finish the match calculation. 
        value = size + p_tree.value + s_tree.value 
        match = Match(a_slic, b_slic, p_tree, s_tree, value) 
        # Append results to tree lists. 
        nodes.append(match) 
        index.append(value) 
     # Return largest matches found. 
     if nodes: 
      return Tree(nodes, index, max(index)) 
    # Give caller null tree object. 
    return Tree(nodes, index, 0) 

對於記憶化,字典是最好的,但它們不能被切成薄片,所以他們在評論指出改爲名單以上。

+0

感謝您的幫助!如上所述,該代碼是爲支持切片的序列(列表,元組,字符串,字節,字節陣列和任何定製容器)編寫的。從你的改變看來,正確記憶需要不可變序列。 – 2010-07-11 21:20:51

1

你可以從Python Decorator Library 使用memoize的裝飾,並使用它像這樣:

@memoized 
def search(a, b): 

您第一時間致電search與自變量a,b,結果計算和memoized(保存在緩存中)。第二次使用相同的參數調用search,結果從緩存中返回。

請注意,對於memoized裝飾工作,參數必須是可散列的。如果ab是數字元組,那麼它們是可散列的。如果它們是列表,那麼在將它們傳遞給search之前,可以將它們轉換爲元組。 它看起來不像search需要dicts作爲參數,但如果它們是,則they would not be hashable和memoization裝飾器將無法將結果保存在緩存中。

+0

你的答案可能不是最好的,因爲他的算法使用列表切片...算法本身可能需要重構。 – kzh 2010-07-10 21:00:37

+0

@kzh:你能詳細說明一下嗎?我很抱歉 - 我沒有跟着你。如果列表切片出了什麼問題? – unutbu 2010-07-10 21:25:18

+0

他需要將字典更改回切片部分的列表。例如,我發佈了我的答案。 – kzh 2010-07-10 21:29:10

相關問題