比較文本文件內容的最快方法

我有一個問題可以幫助簡化我的編程。所以我有這個文件text.txt，在這個文件中，我想查看它，並將它與單詞列表words進行比較，並且每次找到該單詞時，它都會將1添加到整數。比較文本文件內容的最快方法

words = ['the', 'or', 'and', 'can', 'help', 'it', 'one', 'two'] 
ints = [] 
with open('text.txt') as file: 
    for line in file: 
     for part in line.split(): 
      for word in words: 
       if word in part: 
        ints.append(1)

我只是想知道是否有更快的方法來做到這一點？文本文件可能會更大，單詞列表會更大。

來源

2015-06-07 user1985351

你想找到比賽的數量嗎？ – thefourtheye

您可以將words轉換爲set，使查找會更快。這應該會提高程序的性能，因爲查找列表中的值必須一次遍歷列表中的一個元素（O（n）運行時複雜度），但是當您將列表轉換爲集合時，運行時複雜度將降低到O（1）（恆定時間）。因爲集合使用散列來查找元素。

words = {'the', 'or', 'and', 'can', 'help', 'it', 'one', 'two'}

然後每當有比賽，你可以使用sum函數來計算它像這樣

布爾值及其整數等效

在Python，布爾表達式的結果將等於的0或1分別爲和True。

>>> True == 1 
True 
>>> False == 0 
True 
>>> int(True) 
1 
>>> int(False) 
0 
>>> sum([True, True, True]) 
3 
>>> sum([True, False, True]) 
2

所以每當你是否part in words，則結果可能是0或1，我們sum所有這些值。

上方所看到的代碼是功能上等同於

result = 0 
with open('text.txt') as file: 
    for line in file: 
     for part in line.split(): 
      if part in words: 
       result += 1

注：如果你真的想在每當有一個匹配列表以獲得1的，那麼你可以簡單地將生成器表達式轉換爲sum以獲得列表理解，如下所示：

with open('text.txt') as file: 
    print([int(part in words) for line in file for part in line.split()])

字

頻率

如果你真的想找到的個別單詞的頻率在words，那麼你可以使用collections.Counter這樣

from collections import Counter 
with open('text.txt') as file: 
    c = Counter(part for line in file for part in line.split() if part in words)

這將內部統計數文件中出現words中的每個單詞的時間。

按the comment，可以有你的字典，您可以存儲正話正分數，並以負分否定詞，並指望他們這樣

words = {'happy': 1, 'good': 1, 'great': 1, 'no': -1, 'hate': -1} 
with open('text.txt') as file: 
    print(sum(words.get(part, 0) for line in file for part in line.split()))

在這裏，我們使用words.get詞典爲了獲得存儲在單詞中的值，並且如果在詞典中找不到該單詞（既不是好詞也不是壞詞），則返回默認值0。

來源

2015-06-07 14:58:26 thefourtheye

感謝你們，我在這裏列出了所有功能的'timeit'，你的速度是最快的。還有我爲什麼要做'1'。我比較文章是否是正面或負面的文章。所以如果有一個正面的詞，它會放一個'1'，如果是負數，那麼'-1'。然後總結它並顯示文章是否有正面或負面的語氣。再次感謝！ – user1985351

@ user1985351好的，我提供了一種方法來解決您嘗試解決的實際問題。讓我知道它是否有幫助，否則我會刪除它。另外，請在問題本身中包含所有這些信息。這將有助於未來的讀者。 – thefourtheye

您可以使用set.intersection找到一組和列表之間的交集，從而更有效的方式把內set你的話和做的事：

words={'the','or','and','can','help','it','one','two'} 
ints=[] 
with open('text.txt') as f: 
    for line in f: 
     for _ in range(len(words.intersection(line.split()))): 
       ints.append(1)

注意前面的解決方案是基於你的代碼，你將1添加到列表中。你想找到的最終計數可以內sum用生成器表達式：

words={'the','or','and','can','help','it','one','two'} 
with open('text.txt') as f: 
    sum(len(words.intersection(line.split())) for line in f)

來源

2015-06-07 14:57:45 Kasramvd

比較文本文件內容的最快方法

回答

相關問題