查找字符串匹配使用Python

我讀了一些分析文本文件列表計數，每個字被添加到列表，並給出一個ID查找字符串匹配使用Python

#!/usr/bin/python3 
with fi as myfile: 
    for line in myfile: 
    for item in line.split(' '): 
     db[0].append(id_+1) 
     db[2].append(item) 
     ...more stuff

然後，我在列表中搜索每個字的找其匹配並存儲計數爲sim1。如果找到匹配項，我測試下一個單詞是否與連續匹配，並將其計數存儲爲sim2。同樣適用於sim3。我的代碼如下所示：

for i in range(id_-3): 
    sim1=0 
    sim2=0 
    sim3=0 
    for j in range(id_-3): 
    if i==j: continue; 
    if db[2][i] == db[2][j]: 
     sim1 += 1 
     if db[2][i+1] == db[2][j+1]: 
     sim2 += 1 
     if db[2][i+2] == db[2][j+2]: 
      sim3 += 1 
    db[3].append(sim1) 
    db[4].append(sim2) 
    db[5].append(sim3)

這有用，但速度太慢！我相信Python提供更快的搜索方法，但我仍然是一個Py新手！

來源

2013-06-03 Tarek Eldeeb

示例輸入/輸出？ –

看起來您可以從更改數據存儲方式中受益。例如，將其轉化爲詞彙映射（字典）到索引列表。然後你可以檢查這些列表中的連續值。根本沒有搜索。換一種說法;你不是在尋找更快的Python，你正在尋找更好的算法。 –

只需使用字典！它會讓你的生活變得更容易，並且需要更少的代碼，並且很可能會加速它的速度。 –

算法的慢度主要來自於你有一個內部循環迭代len（db [2]）次的外部循環中包含的len（db [2]）次。這意味着內部代碼執行len（db [2]）^ 2次。例如，如果您的文件很大，並且您解析5000個單詞，那麼代碼將運行5000^2 = 25,000,000次！

因此，解決問題的攻角是找到一種方法來消除或顯着降低內部循環的成本。下面是一個示例解決方案，它只需要一次遍歷len（db [2]），然後執行第二個單獨的循環，通過更小的一組項目進行迭代。第二次迭代中有幾個內部循環，但它們的運行次數更少，幾乎沒有任何成本。

我使用一個約48kb的文本文件定時算法和算法。你的算法在我的電腦上平均約爲14秒，我的算法平均爲0.6秒。因此，通過消除內部循環，該算法現在速度提高了23倍以上。我還做了其他一些小的優化，比如將比較改爲數字而不是文本，並從頭開始全尺寸創建存儲陣列以避免使用append（）。 Append（）會導致解釋器根據需要動態增加數組的大小，這比較慢。

from collections import defaultdict 

# Create zero-filled sim1, sim2, sim3 arrays to avoid append() overhead 
len_ = len(db[2]) - 2 
for _ in range(3): 
    db.append([0] * len_) 

# Create dictionary, containing d['word'] = [count, [indexes]] 
# Do just one full iteration, and make good use of it by calculating 
# sim1 (as 'count') and storing an array of number indexes for each word, 
# allowing for a very efficient loop coming up... 
d = defaultdict(lambda: [0, []]) 
for index, word in enumerate(db[2]): 
    if index < len_: 
     # Accumulate sim1 
     d[word][0] += 1 
    # Store all db[2] indexes where this word exists 
    d[word][1].append(index) 

# Now loop only through words which occur more than once (smaller loop) 
for word, (count, indexes) in d.iteritems(): 
    if count > 1: 
     # Place the sim1 values into the db[3] array 
     for i in indexes: 
      if i < len_: 
       db[3][i] = count - 1 
       # Look for sim2 matches by using index numbers 
       next_word = db[2][i+1] 
       for next_word_index in d[next_word][1]: 
        if next_word_index - 1 != i and next_word_index - 1 in indexes: 
         # Accumulate sim2 value in db[4] 
         db[4][i] += 1 
         # Look for sim3 matches 
         third_word = db[2][i+2] 
         if third_word == db[2][next_word_index + 1]: 
          # Accumulate sim3 value in db[5] 
          db[5][i] += 1

來源

2013-10-06 22:42:30

-2

是的，你正在執行一個字符串比較。這真的很慢。你想要的是將你的字符串編譯爲一個常規模式。 :)

查看來自python的庫文庫re。 Python: re

來源

2013-06-03 22:59:00

-1：字符串比較不是「非常慢」 - 它比使用re進行簡單比較要快......「re」不是有用的建議在這種情況下 –

真的嗎？我已經使用編譯好的正則表達式搜索了一個文件，其中包含了3M使用字符串比較的更好時間。 –

你使用的模式是什麼？ –

查找字符串匹配使用Python

回答

相關問題