提高蟒蛇正則表達式的性能

努力提高以下的正則表達式：提高蟒蛇正則表達式的性能

urlpath=columns[4].strip() 
           urlpath=re.sub("(\?.*|\/[0-9a-f]{24})","",urlpath) 
           urlpath=re.sub("\/[0-9\/]*","/",urlpath) 
           urlpath=re.sub("\;.*","",urlpath) 
           urlpath=re.sub("\/",".",urlpath) 
           urlpath=re.sub("\.api","api",urlpath) 
           if urlpath in dlatency:

這種轉換這樣的URL：

/api/v4/path/apiCallTwo?host=wApp&trackId=1347158

到

api.v4.path.apiCallTwo

想嘗試和改進正則表達式的性能，每5分鐘該腳本大約運行50,000個文件，大約需要40秒時代運行。

謝謝

來源

2012-06-05 coderwhiz

確定正則表達式是腳本中的瓶頸，而不是硬盤嗎？ –

磁盤IO相當低。腳本逐行反向讀取日誌文件，直到達到5分鐘以上的行。 – coderwhiz

這是基於分析代碼還是直覺？ – hexparrot

試試這個：

s = '/api/v4/path/apiCallTwo?host=wApp&trackId=1347158' 
re.sub(r'\?.+', '', s).replace('/', '.')[1:] 
> 'api.v4.path.apiCallTwo'

爲了更好的性能，一次編譯正則表達式並重新使用它，就像這樣：

regexp = re.compile(r'\?.+') 
s = '/api/v4/path/apiCallTwo?host=wApp&trackId=1347158' 

# `s` changes, but you can reuse `regexp` as many times as needed 
regexp.sub('', s).replace('/', '.')[1:]

一個更簡單的方法，沒有使用正則表達式：

s[1:s.index('?')].replace('/', '.') 
> 'api.v4.path.apiCallTwo'

來源

2012-06-05 15:10:21

有'urlparse' ... – schlamar

如果沒有''？''，第二種方法會失敗。爲什麼重新發明輪子;） – schlamar

@ ms4py這不是解析URL，而是關於從URL中提取和轉換文本。介意不必要的downvote？ –

您還可以編譯重新語句以獲得性能提升，例如：

compiled_re_for_words = re.compile("\w+") 
compiled_re_for_words.match("test")

來源

2012-06-05 15:13:18

您確定您需要Regex嗎？
即，

urlpath = columns[4].strip() 
urlpath = urlpath.split("?")[0] 
urlpath = urlpath.replace("/", ".")

來源

2012-06-05 15:18:41 user1417475

一襯墊與urlparse：

urlpath = urlparse.urlsplit(url).path.strip('/').replace('/', '.')

來源

2012-06-05 15:23:35 badzil

這是我的oneliner溶液（編輯）。

urlpath.partition("?")[0].strip("/").replace("/", ".")

正如其他人提到的那樣，速度的提高在這裏可以忽略不計。除了使用re.compile（）預編譯表達式之外，我還會開始查看其他地方。

import re 


re1 = re.compile("(\?.*|\/[0-9a-f]{24})") 
re2 = re.compile("\/[0-9\/]*") 
re3 = re.compile("\;.*") 
re4 = re.compile("\/") 
re5 = re.compile("\.api") 
def orig_regex(urlpath): 
    urlpath=re1.sub("",urlpath) 
    urlpath=re2.sub("/",urlpath) 
    urlpath=re3.sub("",urlpath) 
    urlpath=re4.sub(".",urlpath) 
    urlpath=re5.sub("api",urlpath) 
    return urlpath 


myregex = re.compile(r"([^/]+)") 
def my_regex(urlpath): 
    return ".".join(x.group() for x in myregex.finditer(urlpath.partition('?')[0])) 

def test_nonregex(urlpath) 
    return urlpath.partition("?")[0].strip("/").replace("/", ".") 

def test_func(func, iterations, *args, **kwargs): 
    for i in xrange(iterations): 
     func(*args, **kwargs) 

if __name__ == "__main__": 
    import cProfile as profile 

    urlpath = u'/api/v4/path/apiCallTwo?host=wApp&trackId=1347158' 
    profile.run("test_func(orig_regex, 10000, urlpath)") 
    profile.run("test_func(my_regex, 10000, urlpath)") 
    profile.run("test_func(non_regex, 10000, urlpath)")

結果

Iterating orig_regex 10000 times 
    60003 function calls in 0.108 CPU seconds 

.... 

Iterating my_regex 10000 times 
    130003 function calls in 0.087 CPU seconds 

.... 

Iterating non_regex 10000 times 
    40003 function calls in 0.019 CPU seconds

無需做複查。編譯你的正則表達式5個結果

running <function orig_regex at 0x100532050> 10000 times 
    210817 function calls (210794 primitive calls) in 0.208 CPU seconds

來源

2012-06-05 19:15:25 jlujan

通過線路逐一進行中：你不捕捉或分組

，因此不需要在(和)，以及/不在Python的正則表達式特殊字符，所以它並不需要進行轉義：

urlpath = re.sub("\?.*|/[0-9a-f]{24}", "", urlpath)

更換出現之後用的東西零次重複一個/一個/是沒有意義的：

urlpath = re.sub("/[0-9/]+", "/", urlpath)

它使用字符串方法後快卸下固定字符和一切：

urlpath = urlpath.partition(";")[0]

與另一固定字符串更換固定字符串也快使用字符串方法：

urlpath = urlpath.replace("/", ".")

urlpath = urlpath.replace(".api", "api")

來源

2012-06-07 23:22:09 MRAB

提高蟒蛇正則表達式的性能

回答

相關問題