提高Python中正則表達式操作的速度

我有一個運行1M以上不同長度的python腳本。腳本運行非常緩慢。在過去的12個小時裏它只運行了超過30000次。由於文件已經分割，因此分割文件是不成問題的。我的代碼如下所示：提高Python中正則表達式操作的速度

regex1 = re.compile(r"(\{\{.*?\}\})", flags=re.IGNORECASE) 
regex2 = re.compile(r"(<ref.*?</ref>)", flags=re.IGNORECASE) 
regex3 = re.compile(r"(<ref.*?\/>)", flags=re.IGNORECASE) 
regex4 = re.compile(r"(==External links==.*?)", flags=re.IGNORECASE) 
regex5 = re.compile(r"(<!--.*?-->)", flags=re.IGNORECASE) 
regex6 = re.compile(r"(File:[^ ]*?)", flags=re.IGNORECASE) 
regex7 = re.compile(r" [0-9]+ ", flags=re.IGNORECASE) 
regex8 = re.compile(r"(\[\[File:.*?\]\])", flags=re.IGNORECASE) 
regex9 = re.compile(r"(\[\[.*?\.JPG.*?\]\])", flags=re.IGNORECASE) 
regex10 = re.compile(r"(\[\[Image:.*?\]\])", flags=re.IGNORECASE) 
regex11 = re.compile(r"^[^_].*(\))", flags=re.IGNORECASE) 

fout = open(sys.argv[2],'a+') 

with open(sys.argv[1]) as f: 
    for line in f: 
     parts=line.split("\t") 
     label=parts[0].replace(" ","_").lower() 
     line=parts[1].lower() 
     try: 
      line = regex1.sub("",line) 
     except: 
      pass 
     try: 
      line = regex2.sub("",line) 
     except: 
      pass 
     try: 
      line = regex3.sub("",line) 
     except: 
      pass 
     try: 
      line = regex4.sub("",line) 
     except: 
      pass 
     try: 
      line = regex5.sub("",line) 
     except: 
      pass 
     try: 
      line = regex6.sub("",line) 
     except: 
      pass 
     try: 
      line = regex8.sub("",line) 
     except: 
      pass 
     try: 
      line = regex9.sub("",line) 
     except: 
      pass 
     try: 
      line = regex10.sub("",line) 
     except: 
      pass 

     try:  
      for match in re.finditer(r"(\[\[.*?\]\])", line): 
       replacement_list=match.group(0).replace("[","").replace("]","").split("|") 
       replacement_list = [w.replace(" ","_") for w in replacement_list] 
       replacement_for_links=' '.join(replacement_list) 
       line = line.replace(match.group(0),replacement_for_links) 
     except: 
      pass 
     try: 
      line = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»「」‘’]))', '', line, flags=re.MULTILINE) 
     except: 
      pass  
     try: 
      line = line.translate(None, '!"#$%&\'*+,./:;<=>[email protected][\\]^`{|}~') 
     except: 
      pass   
     try: 
      line = line.replace(' (',' ') 
      line=' '.join([word.rstrip(")") if not '(' in word else word for word in line.split(" ")]) 
      line=re.sub(' isbn [\w-]+ ',' ' ,line) 
      line=re.sub(' [p]+ [\w-]+ ',' ' ,line) 
      line = re.sub(' \d+ ', ' ', line) 
      line= re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", line) 
      line = re.sub('\s+', ' ', line).strip() 
      line=re.sub(' isbn [\w-]+ ',' ' ,line) 
     except: 
      pass  
     out_string=label+"\t"+line 
     fout.write(out_string) 
     fout.write("\n") 

fout.close()

有什麼變化，我可以在當前版本上獲得重大改進？

更新1：在使用@fearless_fool的建議進行分析後，我意識到regex3和regex9和http刪除是效率最低的。

更新2：發現使用.*增加了很多正則表達式模式的步驟，這很有趣。我試圖用[^X]*替換它，其中X是我知道它從未在字符串中發生的事情。它爲1000條長線提高了約20倍。例如現在regex1是regex1 = re.compile(r"(\{\{[^\}]*?\}\})", flags=re.IGNORECASE) ....如果我想在負面匹配中使用兩個字符，我不知道該怎麼做。例如，如果我想將(\{\{[^\}]*?\}\})更改爲(\{\{[^\}\}]*?\}\})，我知道這是錯誤的，因爲[]中的任何單詞都被視爲單獨的字符。

來源

2015-12-30 Nick

你爲什麼使用excepts？你如何期待'line = regex1.sub（「」，line）'等等..出錯？ –

我強烈建議你在你的代碼上運行一個profiler（https://docs.python.org/2/library/profile.html）來猜測。 – danmcardle

你在每一行使用了大約20個連續的正則表達式或文本迭代，它只能運行緩慢......你期望你的代碼做什麼？你不能使用更高級的解析器（例如xml解析器...）嗎？ –

使用@fearless_fool推薦有用的正則表達式工具後，我提高了速度顯著通過與代表.*例如更stricted版本正則表達式替換.* ：[^\]]*。整個腳本中的這些更改顯着改善了性能。

來源

2016-01-04 21:59:20 Nick

（提升對某個答案的評論）：我建議您使用優雅而有用的Regex 101 Tool來描述您的個人正則表達式，看看它們中的任何一個是否正在花費過多的時間。

當你在這裏時，你可以在網站上發佈一個完整的例子，這樣其他人就可以看到你用於典型輸入的東西。（我知道你已經這樣做了 - ！偉大的）

來源

2015-12-31 19:44:52

提高Python中正則表達式操作的速度

回答

相關問題