索引文檔中單詞的最有效方法？

這出現在另一個問題，但我認爲最好問這是一個單獨的問題。給句子（100個幾千順序）的大名單：索引文檔中單詞的最有效方法？

[ 
"This is sentence 1 as an example", 
"This is sentence 1 as another example", 
"This is sentence 2", 
"This is sentence 3 as another example ", 
"This is sentence 4" 
]

什麼是編寫以下功能的最佳方式？

def GetSentences(word1, word2, position): 
    return ""

，其中給出了兩個詞，word1，word2和位置position，函數應該返回滿足該限制所有語句列表。例如：

GetSentences("sentence", "another", 3)

應該返回句子1和3作爲句子的指數。我目前的做法是使用字典是這樣的：

Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: []))) 

for sentenceIndex, sentence in enumerate(sentences): 
    words = sentence.split() 
    for index, word in enumerate(words): 
     for i, word2 in enumerate(words[index:): 
      Index[word][word2][i+1].append(sentenceIndex)

但這種快速打擊一切不成比例的對數據集大小爲130 MB作爲我的48GB的RAM在不到5分鐘耗盡。我以某種方式感覺這是一個常見問題，但無法找到任何有關如何有效解決此問題的參考。有關如何解決這個問題的任何建議？

來源

2011-11-05 Legend

只是爲了澄清：'position'是句子中兩個單詞之間的距離嗎？ – misha

@misha：是的。這是正確的。 – Legend

有兩個「句子1」令人困惑。它是否匹配第二個「1」而不是第一個？ – shookster

使用數據庫存儲值。

首先所有的句子添加到一個表（他們應該有標識）。你可以稱它爲例如。 sentences。
第二，創建包含在所有句子（稱爲例如。words，給每個單詞一個ID）的單詞表，保存單獨表格中句子的表格記錄和單詞表格記錄之間的連接（稱之爲例如。 sentences_words，它應該有兩列，最好是word_id和sentence_id）。
當包含所有提及的單詞的句子搜索，你的工作將被簡化：
1. 你應該首先從表words，字正是你尋找的那些找到記錄。查詢看起來是這樣的：
```
SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3'); 
```
2. 其次，你應該從已經要求word_id值（從words表中對應的詞）表sentences找到sentence_id值。初始查詢看起來是這樣的：
```
SELECT `sentence_id`, `word_id` FROM `sentences_words` 
WHERE `word_id` IN ([here goes list of words' ids]); 
```
  這可以簡化爲這樣：
```
SELECT `sentence_id`, `word_id` FROM `sentences_words` 
WHERE `word_id` IN (
    SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3') 
); 
```
3. 過濾器內的Python結果只返回sentence_id具有所有必要的word_id ID，您就值需要。

這基本上是基於存儲在可被最適合於這個表單數據的大量的溶液 - 該數據庫。

編輯：

如果你將只搜索兩句話，你可以做更多的DBMS」側（幾乎所有）。
考慮到您還需要位置差異，您應該在sentences_words表格的第三列（我們稱之爲position）的第三列中存儲單詞的位置，並且在搜索適當的單詞時，應計算與這兩個單詞相關的此值的差異。

來源

2011-11-05 01:20:02 Tadeck

+1非常感謝您的時間。我想我會與此一起去。我正在考慮使用SQLite的時刻，但如果這不能解決MySQL的問題。 – Legend

@傳奇：謝謝。我相信，如果一個數據庫不會被多個用戶同時使用，那麼sqlite非常適合這一點。如果只有一個用戶會使用它，sqlite是我認爲最好的，所以我完全同意你的選擇。 – Tadeck

我回來再次感謝你。在說「使用合適的工具進行正確的工作」方面有很長的路要走:)建立搭配的時間已經從X（X> 12，並沒有完成，因爲它耗盡了內存！）現在使用小時到1小時sqlite，它甚至不重！ – Legend

下面是我在Python中做的。儘管假設這需要多次完成，但數據庫管理系統是這項工作的正確工具。然而，這對於我有一百萬行工作似乎很好。

sentences = [ 
    "This is sentence 1 as an example", 
    "This is sentence 1 as another example", 
    "This is sentence 2", 
    "This is sentence 3 as another example ", 
    "This is sentence 4" 
    ] 

sentences = sentences * 200 * 1000 

sentencesProcessed = [] 

def preprocess(): 
    global sentences 
    global sentencesProcessed 
    # may want to do a regex split on whitespace 
    sentencesProcessed = [sentence.split(" ") for sentence in sentences] 

    # can deallocate sentences now 
    sentences = None 


def GetSentences(word1, word2, position): 
    results = [] 
    for sentenceIndex, sentence in enumerate(sentencesProcessed): 
     for wordIndex, word in enumerate(sentence[:-position]): 
      if word == word1 and sentence[wordIndex + position] == word2: 
       results.append(sentenceIndex) 
    return results 

def main(): 
    preprocess() 
    results = GetSentences("sentence", "another", 3) 
    print "Got", len(results), "results" 

if __name__ == "__main__": 
    main()

來源

2011-11-05 02:06:28 shookster

+1謝謝你的這種做法。事實上，我測試了這個，發現它對於一次性查詢來說速度非常快。但是，我試圖做多個查詢，但查找時間過高，這是預期的，因爲沒有索引。但毫無疑問，這是一個有趣的方法。謝謝。 – Legend

@Legend：是的，它每次查詢時都會查看整個數據集。我只是想嘗試一下:-) – shookster

索引文檔中單詞的最有效方法？

回答

相關問題