在一個巨大的列表中查找/搜索的最有效方式（python）

- 我剛剛解析了一個大文件，並創建了一個包含42.000個字符串/字的列表。我想查詢[對這個列表]來檢查給定的單詞/字符串是否屬於它。所以我的問題是：在一個巨大的列表中查找/搜索的最有效方式（python）

這種查找最有效的方法是什麼？

第一種方法是對列表進行排序（list.sort()），然後只用

>> if word in list: print 'word'

這實在是微不足道的，我相信有一個更好的辦法做到這一點。我的目標是應用快速查找，查找給定字符串是否在此列表中。如果您對其他數據結構有任何想法，歡迎提供。然而，現在我想避免像Tries等更復雜的數據結構。我有興趣聽到關於快速查找的想法（或技巧），或者可能比簡單的in更快地執行搜索的任何其他Python庫方法。

，也是我想知道搜索項

來源

2010-04-23 user229269

不要創建list，創建set的索引。它在不斷的時間內進行查找。

如果您不想要某個集合的內存開銷，請保留一個排序列表並使用bisect模塊搜索它。

from bisect import bisect_left 
def bi_contains(lst, item): 
    """ efficient `item in lst` for sorted lists """ 
    # if item is larger than the last its not in the list, but the bisect would 
    # find `len(lst)` as the index to insert, so check that first. Else, if the 
    # item is in the list then it has to be at index bisect_left(lst, item) 
    return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item)

來源

2010-04-23 18:51:30

感謝您詳細的答覆很多THC4k。其實我正在考慮自己申請一個二進制搜索，但正如我所看到的那樣，無論如何這是對分模塊所做的，所以你節省了我的時間:)。再次感謝您的幫助。 – user229269 2010-04-23 20:31:46

@ user229269，你鎖定在帖子的錯誤部分！你可能想要一個'set'，而不是'list'。 – 2010-04-23 21:34:41

@Mike Graham我知道你在說什麼，但是如果我使用集合，恐怕我可能會遇到內存問題，因爲我的列表實際上是一個快速增長的單詞列表，它最終會達到100.000個字符串和更多 – user229269 2010-04-23 22:13:56

套左右相比列表尚未考慮到的一點：在「解析大文件」人們將期望需要處理複製字/串。你還沒有提到這一點。

很明顯，將新單詞添加到一個集合中可以快速刪除重複內容，而無需額外花費CPU時間或思考時間。如果您嘗試使用列表，它會以O（N ** 2）結尾。如果你把所有東西都附加到一個列表中，並在最後刪除重複項，那麼最明智的做法是......鼓輪...使用一個集合，並且列表的（小）內存優勢很可能會被重複。

來源

2010-04-23 23:21:46

-2

如果您稍後預計複雜的查找 - 複雜的我的意思是不平凡 - 我建議您將它存儲在sqlite3。

來源

2010-04-24 09:02:33

使用這個程序，它看起來像類型的字典是fastes，設置第二，名單與bi_contains第三：

from datetime import datetime 

def ReadWordList(): 
    """ Loop through each line in english.txt and add it to the list in uppercase. 

    Returns: 
    Returns array with all the words in english.txt. 

    """ 
    l_words = [] 
    with open(r'c:\english.txt', 'r') as f_in: 
     for line in f_in: 
      line = line.strip().upper() 
      l_words.append(line) 

    return l_words 

# Loop through each line in english.txt and add it to the l_words list in uppercase. 
l_words = ReadWordList() 
l_words = {key: None for key in l_words} 
#l_words = set(l_words) 
#l_words = tuple(l_words) 

t1 = datetime.now() 

for i in range(10000): 
    #w = 'ZEBRA' in l_words 
    w = bi_contains(l_words, 'ZEBRA') 

t2 = datetime.now() 
print('After: ' + str(t2 - t1)) 

# list = 41.025293 seconds 
# dict = 0.001488 seconds 
# set = 0.001499 seconds 
# tuple = 38.975805 seconds 
# list with bi_contains = 0.014000 seconds

來源

2016-12-28 22:41:29 Superdooperhero

由於字跡更快，感到驚訝。接下來的問題是生成'l_words'對象需要多長時間。 +1！ – Cometsong 2018-02-05 16:34:24

在一個巨大的列表中查找/搜索的最有效方式（python）

回答

相關問題