在一個大列表中搜索子字符串

我試圖找到2個字符串之間的所有插入形式。所以我有一個1400萬字符串的列表，然後我必須檢查每個字符串可能的插入可以將一個字符串轉換爲另一個字符串（基本上是計算插入頻率）。說x是一個字符串，y是另一個字符串，其中x是y的一個子字符串，因此我們必須找出將x轉換爲y的插入內容。在一個大列表中搜索子字符串

我正在使用以下代碼段。它可以工作，但正在走向很多時間。我甚至試圖在64個處理器上分配負載，但仍需要20天時間才能完成。

for i in Words: 
#trying to distribute load across different processes, so can ignore this part 
    h = hashlib.sha256(i) 
    n = int(h.hexdigest(),base=16) 
    if (n%64!=ix): #ix is a process based id 
    continue 


    for j in Words:# 
    if len(i)>len(j): 
     continue 
    if(i!=j and i in j): # i is a substring of j 
     ind=j.find(i) 
     s1=j[0:ind] 
     s2=j[ind+len(i):len(j)] 

        if(len(s1)>0): 
      if (not transform.has_key(s1)): 
       transform[s1]=1 
      else: 
       transform[s1]+=1 

     if(len(s2)>0): 
      if (not transform.has_key(s2)): 
       transform[s2]=1 
      else: 
       transform[s2]+=1

來源

2012-12-20 Slayer

使用PyPy嘗試。在大多數計算任務中，它比普通Python快得多。 – Blender

假設縮進是正確的。縮進在複製期間變得混亂。 – Slayer

相反每個單詞相互比較（二次運行時），取每個字的每個適當的子串（線性運行時，假設字長度爲界），並檢查它是否在單詞集合（仰視set的元素是不變的時間）。

此跑在小於2秒我的筆記本電腦（爲46265個字（長度< 10）與47015個獨特變換（799089總））：

from collections import Counter 

# for testing 
from random import choice, randrange 
from string import ascii_uppercase 
big_word = "".join(choice(ascii_uppercase) for i in range(10000)) 
words = [big_word[randrange(len(big_word)):][:randrange(1, 10)] for i in range(100000)] # words of up to 9 letters; all are substrings of big_word 

# now the real code 
def insertions(words): 
    for word in words: 
     for i in range(1, len(word) - 1): 
      ins = word[:i] 
      rest = word[i:] 
      for j in range(1, len(rest)): 
       if rest[:j] in words: 
        yield ins 
     for i in range(1, len(word) - 1): 
      rest = word[:i] 
      ins = word[i:] 
      for j in range(len(rest) - 1): 
       if rest[j:] in words: 
        yield ins 

transforms = Counter(insertions(set(words)))

來源

2012-12-20 23:07:17 WolframH

謝謝....工作很棒... – Slayer

在一個大列表中搜索子字符串

回答

相關問題