比較python中的兩個大列表

我有一個包含大約400個單詞的列表。還有另一個列表，其中每個列表包含大約150,000個單詞。這個清單有20個這樣的清單。比較python中的兩個大列表

現在我想看看這1500個單詞列表中所有這400個單詞中有多少單詞出現。我也想從這400個單詞中知道一個單詞，在150k單詞列表中出現多少次，其中哪些單詞出現次數最多，次數多少等。

唯一的解決方案我能想到的是多項式時間解決方案。這是一個非常糟糕的解決方案，將是地獄很多慢：

for one_list in list_of_150kwords: 
    for key in 400_words: 
     for word in one_list: 
      if key == word: 
       # count this word 
       # do other stuff

這是一個非常醜陋和壞的解決方案，但我想不出什麼更好的。我試圖通過將這些列表轉換成NumPy數組來嘗試：

list_of_150kwords = numpy.array(list_of_150kwords) 
...

但我仍然覺得它很慢。其他解決方案？或者任何圖書館？

來源

2014-02-15 avi

這聽起來像使用set的好機會：

set_of_150kwords = set(list_of_150kwords) 
one_set = set(one_list) 

len(one_set & set_of_150kwords) # set intersection is & 
=> number of elements common to both sets

按集理論，兩個集合的交集給出了出現在都集合中的元素，那麼它是採取一個簡單的問題它的長度。對於第二部分（這些詞中哪些最常出現，多少次等）創建一個Counter與list_of_150kwords，這將告訴你每個單詞出現在列表中的次數。交集將告訴你哪些是常用詞，解決你的兩個要求。

來源

2014-02-15 18:17:11

哦，我沒試過集。他們比NumPy更快嗎？讓我跑步，看看 – avi

我相信'set'和'Counter'是這裏工作的正確工具，不僅僅是'numpy'數組。 –

但是我如何計算'one_list'中的單詞出現在'set_of_150kwords'多少次？ – avi

from collections import Counter 

search_data = [ 
    ["list", "of", "150k", "words"], 
    ["another", "list", "of", "150k", "words"], 
    ["yet", "another", "list", "of", "150k", "words"] 
    # ... 17 more of these 
] 

search_words = ["four", "hundred", "words", "to", "search", "for"] 

def word_finder(words_to_find): 
    lookfor = set(word.lower() for word in words_to_find) 
    def get_word_count(text): 
     return Counter(word for word in (wd.lower() for wd in text) if word in lookfor) 
    return get_word_count 

def get_words_in_common(counters): 
    # Maybe use c.viewkeys() instead of set(c)? Which is faster? 
    return reduce(operator.and_, (set(c) for c in counters)) 

def main(): 
    wordcount = word_finder(search_words) 
    counters = [wordcount(wordlst) for wordlst in search_data] 
    common_to_all = get_words_in_common(counters) 
    print(common_to_all) 

if __name__=="__main__": 
    main()

來源

2014-02-15 20:57:19

「也許使用c.viewkeys（）而不是set（c）？哪個更快？「< - 不要懷疑，也不在乎，如果代碼運行速度太慢，配置文件，如果這是緩慢的部分，* test *。在此之前，編寫儘可能*可讀*的代碼。 –

這是一個Trie將是有用的地方的典型例子。你需要爲你的每個150K列表創建一個Trie。然後你可以檢查給定的單詞是否存在於O（W）時間的列表中。其中W是單詞的最大長度。

然後，您可以遍歷400個單詞列表並檢查每個作品是否在150K字詞列表中。

假設L即150K列表的數量遠小於150K並且W遠小於150K，那麼集合連接將永遠不會像Trie比較那樣快。

的最終運行的複雜性：

N = 400 //Size of small list 
W = 10 // Max word Length 
M = 150K // Max size of the 150K lists 
P = 4 // Number of 150K lists 

P * M // To construct Trie 
N * P * W // To find each word in the 150k lists 
MP + NPW // Total complexit

來源

2014-02-16 21:47:31

經典的Map Reduce問題.... http://sist.sysu.edu.cn/~wuweig/DSCC/Inverted%20Indexing%20by%20MapReduce.pdf

來源

2014-02-17 07:22:24 roshan