一個trie的快速序列化

我的應用程序的一部分使用trie到chunk單詞在一起。例如，["Summer", "in", "Los", "Angeles"]變爲["Summer", "in", "Los Angeles"]。一個trie的快速序列化

現在，這個特里從a large database填充，在本地存儲爲SQL，在應用程序啓動。這需要很長時間，大約15s。我想減少應用程序的啓動時間，所以我已經考慮過序列化Trie。不幸的是，pickling太慢 - 比從數據庫加載所有內容慢。

有沒有更快的方法來序列化我的trie？

這裏的特里類的樣子：

class Trie: 
    def __init__(self): 
     self.values = set() 
     self.children = dict() 

    def insert(self, key, value): 
     """Insert a (key,value) pair into the trie. 
     The key should be a list of strings. 
     The value can be of arbitrary type.""" 
     current_node = self 
     for key_part in key: 
      if key_part not in current_node.children: 
       current_node.children[key_part] = Trie() 
      current_node = current_node.children[key_part] 
     current_node.values.add(value) 

    def retrieve(self, key): 
     """Returns either the value stored at the key, or raises KeyError.""" 
     current_node = self 
     for key_part in key: 
      current_node = current_node.children[key_part] 
     return current_node.values

有沒有改變它的任何方式，將使其更序列化？

來源

2014-04-25 misha

我曾經這樣做，以節省內存（http://stackoverflow.com/questions/2574357/how-to-transform-phrases-and-words-into-md5-hash），但與優化數據庫，如mongoDB和索引API像Lucene，我會避免建立一個新的結構索引和檢索的東西。 – alvas

MongoDB的+1，我實際上正在考慮離開關係數據庫。 – misha

我最終在MongoDB中存儲了trie。

有一個網絡開銷，但提供的數據庫是本地主機它不是太糟糕。

來源

2014-05-07 07:14:34 misha

我知道我不會給一個Python的答案，但仍這可能是有用的：

創建，壓縮和存儲特里確實是一個艱鉅的任務。我花了不少時間來考慮自動建議的數據結構，並盡我所知的最完美的解決方案是由朱塞佩奧塔維亞諾和提供partly described in my blog article

即使它不會意義實施奧塔維亞諾as described in his paper的整個解決方案在python中，你仍然可以遵循基本思想將完整的trie存儲爲一大塊內存，並且只提供下一跳的位置。

通過這種方式，您可以輕鬆地將此數組或內存塊序列化到硬盤上。我不完全確定python，但我認爲這個操作應該可以工作，而且比序列化數據結構要快得多。

我知道，Ottavianos工作的c實現存在，你甚至可以使用python c綁定。

來源

2014-04-25 14:11:58

一個trie的快速序列化

回答

相關問題