在內存中沒有複製索引的多處理搜索

我必須搜索科學期刊文章的大表，以獲取某些特定文章，這些文章包含在單獨的文件中。我的方法是使用Whoosh從大表中構建搜索索引，然後搜索索引中分隔文件的每篇文章。這很好，但需要很長時間（〜2周）。所以我想通過實現多處理來加快速度，這就是我正在掙扎的地方。在內存中沒有複製索引的多處理搜索

沒有多我的「簡單」搜索的重要組成部分如下所示：

articles = open('AuthorArticles.txt', 'r', encoding='utf-8').read().splitlines() 

fs = FileStorage(dir_index, supports_mmap=False) 
ix = fs.open_index() 
with ix.searcher() as srch: 
    for article in articles: 
     # do stuff with article 
     q = QueryParser('full_text', ix.schema).parse(article) 
     res = srch.search(q, limit=None) 
     if not res.is_empty(): 
     with open(output_file, 'a', encoding='utf-8') as target: 
      for r in res: 
       target.write(r['full_text'])

現在，我特別希望實現的是，該指數被加載到內存中，然後多個進程訪問和搜索爲文章。我嘗試到目前爲止是這樣的：

articles = open('AuthorArticles.txt', 'r', encoding='utf-8').read().splitlines() 

def search_index(article): 
    fs = FileStorage(dir_index, supports_mmap=True) 
    ix = fs.open_index() 
    with ix.searcher() as srch: 
     result = [] 
     for a in article 
     # do stuff with article 
     q = QueryParser('full_text', ix.schema).parse(q) 
     res = srch.search(q, limit=None) 
     if not res.is_empty(): 
      for r in res: 
       result.extend[r['full_text']] 
    return result 

if __name__ == '__main__': 
    with Pool(4) as p: 
     results = p.map(search_index, articles, chunksize=100) 
     print(results)

但是，據我瞭解，這樣每個單獨進程加載索引內存（這是行不通的，因爲該指數是相當大的）。

有沒有什麼辦法可以以相對簡單的方式實現我所需要的？基本上我想要做的就是使用手頭的整個計算能力來搜索索引。

來源

2015-06-07 smint75

你可以給一個'AuthorArticles.txt'的簡短樣本，所以我做測試嗎？ –

您可以使用multiprocessing shared ctypes在進程之間共享內存。如果您只需要讀取權限，您可以將lock=False傳遞給Value或Array。

也許這個回答可以幫助你進一步移動：如果您使用mmap（與access參數設置爲ACCESS_READ）「讀」的文件，你的操作系統的虛擬內存子系統
How to combine Pool.map with Array (shared memory) in Python multiprocessing?

來源

2015-06-07 12:12:11 matino

會確保只有一個副本將被加載到內存中。

由於您用supports_mmap=True初始化FileStorage，它將使用mmap並且您的問題應該已經解決。 :-)

來源

2015-06-07 12:30:21

太棒了！非常感謝您的幫助，將嘗試。這對每個操作系統都適用嗎？ – smint75

正如您在文檔中看到的，mmap的參數在POSIX平臺和ms-windows之間是不同的。如果'FileStorage'考慮到這一點，它應該沒問題。檢查源代碼。 –

在內存中沒有複製索引的多處理搜索

回答

相關問題