re.match與re.search性能差異

我試圖使用timeit模塊比較re.match和re.search，我發現匹配比搜索更好，當我想要找到的字符串位於字符串的開頭。re.match與re.search性能差異

>>> s1 = ''' 
... import re 
... re.search(r'hello','helloab'*100000) 
... ''' 
>>> timeit.timeit(stmt=s1,number=10000) 
32.12064480781555 


>>> s = ''' 
... import re 
... re.match(r'hello','helloab'*100000) 
... ''' 
>>> timeit.timeit(stmt=s,number=10000) 
30.9136700630188

現在，我知道比賽會在該字符串的開頭模式，如果找到返回一個對象，但我想知道是如何搜索工作。

在開始找到字符串後，搜索是否會執行任何額外的匹配，從而減慢搜索速度？

更新

使用@大衛羅賓遜代碼後我得到的結果simlar他。

>>> print timeit.timeit(stmt="r.match('hello')", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('hello')", 
...    number = 10000000) 
49.9567620754 
>>> print timeit.timeit(stmt="r.search('hello')", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('hello')", 
...    number = 10000000) 
35.6694438457

因此，更新後的問題是，爲什麼現在search在業績match？

來源

2012-10-09 RanRag

你真的應該在'setup'參數中爲'timeit'構建測試字符串和正則表達式對象，這樣你只能測量匹配/搜索本身。並且做10次以上的迭代。 – interjay

@interjay：增加到10000。試過'號碼= 100000'，但在我的上網本上花費了太多時間。 – RanRag

...但你沒有做我說的其他事情，所以你沒有測量任何重要的問題（大部分時間都花在構建字符串上）。 – interjay

在我的機器上（Python 2.7.3在Mac OS 10.7.3,1.7 GHz Intel Core i5上），在完成字符串構建，導入re和正則表達式編譯後，執行10000000次迭代而不是10次，我覺得正好相反：

import timeit 

print timeit.timeit(stmt="r.match(s)", 
      setup="import re; s = 'helloab'*100000; r = re.compile('hello')", 
      number = 10000000) 
# 6.43165612221 
print timeit.timeit(stmt="r.search(s)", 
      setup="import re; s = 'helloab'*100000; r = re.compile('hello')", 
      number = 10000000) 
# 3.85176897049

來源

2012-10-09 15:55:35

現在，這很有趣。 – RanRag

我確認了這個結果。（OS X 10.5.8，python 2.7.3）我發現「搜索」速度快了大約25％。 – mgilson

正如編輯中指出的那樣，如果在設置語句中完成正則表達式編譯，我發現差距會增大。 –

「因此，更新後的問題是，爲什麼現在搜索出表現是否匹配？」

在一個字符串使用，而不是一個正則表達式這種特殊的情況下，確實re.search比re.match默認CPython的實現（我還沒有在Python的其他化身測試這一點）稍快。

>>> print timeit.timeit(stmt="r.match(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('hello')", 
...    number = 10000000) 
3.29107403755 
>>> print timeit.timeit(stmt="r.search(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('hello')", 
...    number = 10000000) 
2.39184308052

展望C code behind those modules，它出現在搜索代碼有一個內置的優化to quickly match patterns prefixed with a string lateral。在上面的示例中，整個模式是一個沒有正則表達式模式的文字字符串，所以這個優化的例程用於匹配整個模式。

通知一旦我們介紹的正則表達式的符號和，作爲文字串前綴變短的性能如何降低：

>>> print timeit.timeit(stmt="r.search(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('hell.')", 
...    number = 10000000) 

3.20765399933 
>>> 
>>> print timeit.timeit(stmt="r.search(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('hel.o')", 
...    number = 10000000) 
3.31512498856 
>>> print timeit.timeit(stmt="r.search(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('he.lo')", 
...    number = 10000000) 
3.31983995438 
>>> print timeit.timeit(stmt="r.search(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('h.llo')", 
...    number = 10000000) 
3.39261603355

對於包含正則表達式模式的圖案的部分，SRE_MATCH被用於確定匹配。這與re.match背後的代碼基本相同。

如果模式以正則表達式模式而非文字字符串開頭，請注意結果如何接近（re.match稍快）。

>>> print timeit.timeit(stmt="r.match(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('.ello')", 
...    number = 10000000) 
3.22782492638 
>>> print timeit.timeit(stmt="r.search(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('.ello')", 
...    number = 10000000) 
3.31773591042

換句話說，忽略這一事實search和match具有不同的目的，re.search比re.match更快僅當模式是文字串。

當然，如果您使用的是文字字符串，您可能更願意使用字符串操作。

>>> # Detecting exact matches 
>>> print timeit.timeit(stmt="s == r", 
...    setup="s = 'helloab'*100000; r = 'hello'", 
...    number = 10000000) 
0.339027881622 

>>> # Determine if string contains another string 
>>> print timeit.timeit(stmt="s in r", 
...    setup="s = 'helloab'*100000; r = 'hello'", 
...    number = 10000000) 
0.479326963425 


>>> # detecting prefix 
>>> print timeit.timeit(stmt="s.startswith(r)", 
...    setup="s = 'helloab'*100000; r = 'hello'", 
...    number = 10000000) 
1.49393510818 
>>> print timeit.timeit(stmt="s[:len(r)] == r", 
...    setup="s = 'helloab'*100000; r = 'hello'", 
...    number = 10000000) 
1.21005606651

來源

2012-10-10 09:09:01

神奇的研究和實驗。特別感謝您提供源代碼的鏈接。 – pswaminathan

re.match與re.search性能差異

回答

相關問題