爲什麼`\ s +`在這個Perl正則表達式中比`\ s \ s *`快得多？

爲什麼替換\s*（或甚至\s\s*）與\s+導致此輸入的這種加速？爲什麼` s +`在這個Perl正則表達式中比` s s *`快得多？

use Benchmark qw(:all); 
$x=(" " x 100000) . "_\n"; 
$count = 100; 
timethese($count, { 
    '/\s\s*\n/' => sub { $x =~ /\s\s*\n/ }, 
    '/\s+\n/' => sub { $x =~ /\s+\n/ }, 
});

Link to online version

我的代碼中發現一個緩慢的正則表達式s/\s*\n\s*/\n/g - 鑑於末由大量的空間，在這裏和那裏的幾個非空格450KB輸入文件，以及最終的換行符時 - 正則表達式懸而未決。

我直覺地用s/\s+\n/\n/g; s/\n\s+/\n/g;替換了正則表達式，一切都很好。

但爲什麼這麼快？使用re Debug => "EXECUTE"後，我注意到\s+版本以某種方式優化，只在一個迭代中運行：http://pastebin.com/0Ug6xPiQ

Matching REx "\s*\n" against "  _%n" 
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d ][{non-utf8-latin1-all}{unicode_all}] against "  _%n" (9 bytes) 
    0 <> <  _%n>   | 1:STAR(3) 
            SPACE can match 7 times out of 2147483647... 
            failed... 
    1 < > <  _%n>   | 1:STAR(3) 
            SPACE can match 6 times out of 2147483647... 
            failed... 
    2 < > <  _%n>   | 1:STAR(3) 
            SPACE can match 5 times out of 2147483647... 
            failed... 
    3 < > < _%n>   | 1:STAR(3) 
            SPACE can match 4 times out of 2147483647... 
            failed... 
    4 < > < _%n>   | 1:STAR(3) 
            SPACE can match 3 times out of 2147483647... 
            failed... 
    5 <  > < _%n>   | 1:STAR(3) 
            SPACE can match 2 times out of 2147483647... 
            failed... 
    6 <  > < _%n>   | 1:STAR(3) 
            SPACE can match 1 times out of 2147483647... 
            failed... 
    8 <  _> <%n>   | 1:STAR(3) 
            SPACE can match 1 times out of 2147483647... 
    8 <  _> <%n>   | 3: EXACT <\n>(5) 
    9 <  _%n> <>   | 5: END(0) 
Match successful!

Matching REx "\s+\n" against "  _%n" 
Matching stclass SPACE against "  _" (8 bytes) 
    0 <> <  _%n>   | 1:PLUS(3) 
            SPACE can match 7 times out of 2147483647... 
            failed...

我知道的Perl 5.10+將立即失敗，正則表達式（不運行它），如果換行不存在。我懷疑它正在使用換行符的位置來減少搜索量。對於上面的所有情況，似乎都巧妙地減少了涉及的回溯（通常/\s*\n/針對一串空格需要指數時間）。任何人都可以提供洞察爲什麼\s+版本是如此之快？

另請注意，\s*?不提供任何加速。

來源

2016-07-18 rjh

'\ s'也與'\ n'匹配並沒有幫助。不是換行符的空白字符是'[^ \ S \ n]'，或者可以使用「水平空格」'\ h'。 – Borodin

您可以將比較縮小到'/ \ s * \ n /'和'/ \ s + \ n /'[見live]（http://rextester.com/DSXF83795）。並且請注意，如果字符串不匹配，它只會更快。在比賽的情況下，它似乎需要同時 –

@ThomasAyoub我不認爲這是縮小比較。 '\ s \ s *'應該與'\ s +'相同，而你發佈的兩個是不同的正則表達式。不過，我同意，即使在你發佈的兩個人之間的表現差異也令人驚訝！ – rjh

當一個模式開始時有一個「加」節點（例如\s+）並且該節點未能匹配時，正則表達式引擎向前跳到故障點並再次嘗試;與\s*，另一方面，引擎一次只提前一個字符。

伊夫·奧頓解釋這種優化很好here：（！doevery）

啓動類優化有兩種模式，「想盡一切有效的起始位置」（doevery）和「觸發器模式」，它只是改掉序列中的第一個有效開始位置。現在我們知道，如果我們在匹配「123456」後未能匹配X，那麼在「23456」之後我們也將無法匹配（假設沒有邪惡的技巧無論如何都會禁用優化），所以我們知道我們可以跳過，直到檢查/失敗/然後纔開始尋找真正的匹配。這是觸發器模式。

/\s+/觸發器觸發器模式; /\s*/,/\s\s*/和/\s\s+/不。此優化不能應用於像「\s*」這樣的「星形」節點，因爲它們可以匹配零個字符，因此，序列中某一點的故障不代表以後在同一序列中出現故障。

你可以在調試輸出中看到每個正則表達式。我突出顯示了跳過的字符^。比較這（跳過一次4個字符）：

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d+x/' 
    ... 
    0 <> <123 456 78>   | 1:PLUS(3) 
            POSIXD[\d] can match 3 times out of 2147483647... 
            failed... 
    4 <123 > <456 789 x>  | 1:PLUS(3) 
     ^^^^ 
            POSIXD[\d] can match 3 times out of 2147483647... 
            failed... 
    8 <23 456 > <789 x>  | 1:PLUS(3) 
     ^^^^ 
            POSIXD[\d] can match 3 times out of 2147483647... 
            failed...

這個（跳過一次在一個或兩個字符）：

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d*x/' 
    ... 
    0 <> <123 456 78>   | 1:STAR(3) 
            POSIXD[\d] can match 3 times out of 2147483647... 
            failed... 
    1 <1> <23 456 789>  | 1:STAR(3) 
    ^
            POSIXD[\d] can match 2 times out of 2147483647... 
            failed... 
    2 <12> <3 456 789 >  | 1:STAR(3) 
    ^
            POSIXD[\d] can match 1 times out of 2147483647... 
            failed... 
    4 <123 > <456 789 x>  | 1:STAR(3) 
     ^^ 
            POSIXD[\d] can match 3 times out of 2147483647... 
            failed... 
    5 <123 4> <56 789 x>  | 1:STAR(3) 
     ^
            POSIXD[\d] can match 2 times out of 2147483647... 
            failed... 
    6 <23 45> <6 789 x>  | 1:STAR(3) 
     ^
            POSIXD[\d] can match 1 times out of 2147483647... 
            failed... 
    8 <23 456 > <789 x>  | 1:STAR(3) 
      ^^ 
            POSIXD[\d] can match 3 times out of 2147483647... 
            failed... 
    9 <23 456 7> <89 x>  | 1:STAR(3) 
      ^
            POSIXD[\d] can match 2 times out of 2147483647... 
            failed... 
    10 <23 456 78> <9 x>  | 1:STAR(3) 
      ^
            POSIXD[\d] can match 1 times out of 2147483647... 
            failed... 
    12 <23 456 789 > <x>  | 1:STAR(3) 
       ^^ 
            POSIXD[\d] can match 0 times out of 2147483647... 
    12 <23 456 789 > <x>  | 3: EXACT <x>(5) 
    13 <23 456 789 x> <>  | 5: END(0)

注意，優化並不適用於/\s\s+/，因爲\s+不在模式的開始。雖然/\s\s+/（邏輯上等於/\s{2,}/）和/\s\s*/（邏輯上等於/\s+/）可能可能被優化，詢問perl5-porters是否值得付出努力可能是有意義的。

_{如果你有興趣，「觸發器模式」是通過時，它的編譯設置PREGf_SKIP旗正則表達式啓用。請參閱5.24.0源文件中regcomp.c中第7344和7405行的代碼和regexec.c中的第1585行。}

來源

2016-07-21 04:54:39 ThisSuitIsBlackNot

謝謝，這正是我正在尋找的答案（實際上深入研究C源代碼並解釋了優化）。非常感謝！！ – rjh

根據http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016，這似乎特別熱門！ – rjh

如果'R +'比RR更好，那麼爲什麼正則表達式編譯器不能在AST級別用'R +'來識別和替換'RR *'？ – Kaz

我不知道正則表達式引擎的內部的，但它看起來像它不承認\s+在某種程度上相同爲\s\s*，因爲在第二個它的空間相匹配，然後嘗試匹配不斷增長的空間數量，而在第一個中，它立即得出結論：沒有匹配。

使用use re qw(Debug);輸出清楚地表明這一點，使用一個更短的字符串：

test_re.pl

#!/usr/bin/env perl 
use re qw(debug); 

$x=(" " x 10) . "_\n"; 
print '-'x50 . "\n"; 
$x =~ /\s+\n/; 
print '-'x50 . "\n"; 
$x =~ /\s\s*\n/; 
print '-'x50 . "\n";

輸出

Compiling REx "\s+\n" 
Final program: 
    1: PLUS (3) 
    2: SPACE (0) 
    3: EXACT <\n> (5) 
    5: END (0) 
floating "%n" at 1..2147483647 (checking floating) stclass SPACE plus minlen 2 
Compiling REx "\s\s*\n" 
Final program: 
    1: SPACE (2) 
    2: STAR (4) 
    3: SPACE (0) 
    4: EXACT <\n> (6) 
    6: END (0) 
floating "%n" at 1..2147483647 (checking floating) stclass SPACE minlen 2 
-------------------------------------------------- 
Guessing start of match in sv for REx "\s+\n" against "   _%n" 
Found floating substr "%n" at offset 11... 
    start_shift: 1 check_at: 11 s: 0 endpos: 11 
Does not contradict STCLASS... 
Guessed: match at offset 0 
Matching REx "\s+\n" against "   _%n" 
Matching stclass SPACE against "   _" (11 bytes) 
    0 <> <   >   | 1:PLUS(3) 
            SPACE can match 10 times out of 2147483647... 
            failed... 
Contradicts stclass... [regexec_flags] 
Match failed 
-------------------------------------------------- 
Guessing start of match in sv for REx "\s\s*\n" against "   _%n" 
Found floating substr "%n" at offset 11... 
    start_shift: 1 check_at: 11 s: 0 endpos: 11 
Does not contradict STCLASS... 
Guessed: match at offset 0 
Matching REx "\s\s*\n" against "   _%n" 
Matching stclass SPACE against "   _" (11 bytes) 
    0 <> <   >   | 1:SPACE(2) 
    1 < > <   _>  | 2:STAR(4) 
            SPACE can match 9 times out of 2147483647... 
            failed... 
    1 < > <   _>  | 1:SPACE(2) 
    2 < > <  _>  | 2:STAR(4) 
            SPACE can match 8 times out of 2147483647... 
            failed... 
    2 < > <  _>  | 1:SPACE(2) 
    3 < > <  _%n>  | 2:STAR(4) 
            SPACE can match 7 times out of 2147483647... 
            failed... 
    3 < > <  _%n>  | 1:SPACE(2) 
    4 < > <  _%n>  | 2:STAR(4) 
            SPACE can match 6 times out of 2147483647... 
            failed... 
    4 < > <  _%n>  | 1:SPACE(2) 
    5 <  > <  _%n>  | 2:STAR(4) 
            SPACE can match 5 times out of 2147483647... 
            failed... 
    5 <  > <  _%n>  | 1:SPACE(2) 
    6 <  > < _%n>  | 2:STAR(4) 
            SPACE can match 4 times out of 2147483647... 
            failed... 
    6 <  > < _%n>  | 1:SPACE(2) 
    7 <  > < _%n>  | 2:STAR(4) 
            SPACE can match 3 times out of 2147483647... 
            failed... 
    7 <  > < _%n>  | 1:SPACE(2) 
    8 <  > < _%n>  | 2:STAR(4) 
            SPACE can match 2 times out of 2147483647... 
            failed... 
    8 <  > < _%n>  | 1:SPACE(2) 
    9 <   > < _%n>  | 2:STAR(4) 
            SPACE can match 1 times out of 2147483647... 
            failed... 
    9 <   > < _%n>  | 1:SPACE(2) 
    10 <   > <_%n>  | 2:STAR(4) 
            SPACE can match 0 times out of 2147483647... 
            failed... 
Contradicts stclass... [regexec_flags] 
Match failed 
-------------------------------------------------- 
Freeing REx: "\s+\n" 
Freeing REx: "\s\s*\n"

來源

2016-07-18 08:57:53 xxfelixxx

我已經附加了一個類似的調試輸出到我的問題，但感謝您的調查。我感興趣的是爲什麼它發生了:) – rjh

首先，即使所得到的正則表達式不會保持相同的含義，讓我們將正則表達式減少到\s*0和\s+0，並使用(" " x 4) . "_0"作爲輸入。對於懷疑者，你可以看到here，滯後仍然存在。

現在讓我們來看看下面的代碼：

$x = (" " x 4) . "_ 0"; 
$x =~ /\s*0/; # The slow line 
$x =~ /\s+0/; # The fast line

挖一點用use re debugcolor;我們得到以下的輸出：

Guessing start of match in sv for REx "\s*0" against " _0" 
Found floating substr "0" at offset 5... 
start_shift: 0 check_at: 5 s: 0 endpos: 6 checked_upto: 0 
Does not contradict STCLASS... 
Guessed: match at offset 0 
Matching REx "\s*0" against " _0" 
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against " _0" (6 bytes) 
    0 < _0>| 1:STAR(3) 
            POSIXD[\s] can match 4 times out of 2147483647... 
            failed... 
    1 < _0>| 1:STAR(3) 
            POSIXD[\s] can match 3 times out of 2147483647... 
            failed... 
    2 < _0>| 1:STAR(3) 
            POSIXD[\s] can match 2 times out of 2147483647... 
            failed... 
    3 < _0>| 1:STAR(3) 
            POSIXD[\s] can match 1 times out of 2147483647... 
            failed... 
    5 < _0>| 1:STAR(3) 
            POSIXD[\s] can match 0 times out of 2147483647... 
    5 < _0>| 3: EXACT <0>(5) 
    6 < _0>| 5: END(0) 
Match successful! 

----------------------- 

Guessing start of match in sv for REx "\s+0" against " _0" 
Found floating substr "0" at offset 5... 
start_shift: 1 check_at: 5 s: 0 endpos: 5 checked_upto: 0 
Does not contradict STCLASS... 
Guessed: match at offset 0 
Matching REx "\s+0" against " _0" 
Matching stclass POSIXD[\s] against " _" (5 bytes) 
    0 < _0>| 1:PLUS(3) 
            POSIXD[\s] can match 4 times out of 2147483647... 
            failed... 
Contradicts stclass... [regexec_flags] 
Match failed

的Perl似乎be optimized for failure。它將首先查找常量字符串（它只消耗O（N））。在這裏，它會尋找0：Found floating substr "0" at offset 5...

然後，它會用正則表達式的可變部分開始，respectivly \s*和\s+，對整個最小字符串檢查：

Matching REx "\s*0" against " _0" 
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against " _0" (6 bytes) 
Matching REx "\s+0" against " _0" 
Matching stclass POSIXD[\s] against " _" (5 bytes) # Only 5 bytes because there should be at least 1 "\s" char

後它會尋找第一個位置在位置0

\s*0滿足stclass要求，在這裏：
- 從0開始，找到4個空格然後失敗;
- 從1開始，找到3個空格則失敗;
- 從2開始，找到2個空格則失敗;
- 從3開始，找到1個空格則失敗;
- 從4開始，找到0個空格，然後不會失敗;在0
  - 開始，發現4個空格，然後失敗：
  - 找到一個確切的0
- \s+0。由於空格的最小數量不匹配，正則表達式立即失敗。

如果你想與Perl的正則表達式優化的樂趣，你可以考慮下面的正則表達式/ *\n和/ * \n。乍一看，它們看起來是一樣的，具有相同的含義......但是如果你運行它對(" " x 40000) . "_\n"第一個將檢查所有可能性，而第二個將查找" \n"並立即失敗。

在一個香草，非優化的正則表達式引擎中，這兩個正則表達式都會導致災難性的回溯，因爲它們需要在模式顛簸時重試該模式。然而，在上面的例子中，二不，因爲它已經被優化到find floating substr "0%n"

你可以看到Jeff Atwood's blog另一個例子不能用Perl。

還請注意，這個問題是不是\s考慮，但其中xx*是用來代替x+任何模式看example with 0s也regex explosive quantifiers

在這樣短的例子的行爲是「容易找到」，但如果你開始玩例如：Regular expression hangs program (100% CPU usage)

來源

2016-07-18 09:24:13

在大多數正則表達式引擎'\ s +'中也會以完全相同的方式導致災難性的回溯。所以我的問題是，Perl如何優化'\ s +'情況要快得多？ – rjh

**這兩種**都會導致災難性的回溯（例如使用PCRE引擎，因爲它沒有針對這種情況進行優化：['s \ s * \ n']（https://regex101.com/r/） tU6sX9/1），['\ s + \ n']（https://regex101.com/r/iY9jT3/1））。這裏的區別很可能是由Perl正則表達式的一些優化引起的。 – nhahtdh

對於我而言，這並不是顯而易見的，爲什麼引擎需要回溯。你能澄清嗎？ – jpmc26

\s+\n要求\n之前的字符是SPACE。

根據use re qw(debug)該編譯規定，它需要一個已知數量的空格的直串，直到字符串\n，它首先在輸入中檢查。然後它檢查固定長度的僅空間子字符串與輸入的其餘部分，失敗，因爲它涉及到_。無論輸入有多少空間，都可以單獨檢查。（當有更多_\n每個被發現同樣直接失敗，每次調試輸出。）

看着它這樣，這是你幾乎會期望的優化，利用一個相當具體的搜索模式，並與該輸入勒金出。除了與其他引擎相比，這顯然不會做這種分析。

With \s*\n但事實並非如此。一旦找到\n並且前一個字符不是空格，則搜索未失敗，因爲\s*不允許任何內容（零字符）。沒有固定長度的子串，而且它在回溯遊戲中。

來源

2016-07-18 09:38:42 zdim

'/ \ s \ s *？\ n \'遭受回溯？ – Zaid

@Zaid：關於回溯，貪婪和非貪婪模式*相同*。唯一的區別是，貪婪模式將盡可能以* long *開頭，並縮短到正則表達式的其餘部分匹配，而非貪婪模式將盡可能以* short *開頭，並延長到其餘部分正則表達式匹配。 – Borodin

調試模式的輸出表明引擎實際上從左到右匹配'\ s *'或'\ s +'，而不是在'\ n'之前檢查是否存在'_'。我已經爲源代碼添加了額外的日誌記錄，那裏顯然有一個循環。 – nhahtdh

爲什麼`\ s +`在這個Perl正則表達式中比`\ s \ s *`快得多？

回答

相關問題