爲什麼正則表達式不關心字符串長度

爲什麼在正則表達式上的輸入長度不會影響性能，以及這可能如何？爲什麼正則表達式不關心字符串長度

生成的字符串是這樣的：128個隨機字符。然後括號內的兩個數字。這一過程重複了很多次。

128 radnom characters....(-2435346|45436) 128 radnom characters....(-32525562|-325346)

正則表達式獲取括號內的所有數字。這裏是模式。

\(([-+]?\d+\|[-+]?\d+)\)

所以比賽會像

-2435346|45436 
-32525562|-325346 
etc...

這裏是做基準的代碼。生成輸入後我開始秒錶，因爲我只想評估匹配時間。

Random rand = new Random(); 
Func<string> getRandString = // generates 128 random characters. 
    () => Enumerable.Range(0, 128).Select(x => (char) rand.Next()).Aggregate("", (a, b) => a + b); 
Func<int> getRandInteger =() => rand.Next(); // generates random number. 
string format = "{0}({1}|{2})"; 

// Generate the big string. 
StringBuilder bigstr = new StringBuilder(); 
for (int i = 0; i < 100; i++) // repeat 100 times. 
{ 
    bigstr.Append(string.Format(format, getRandString(), getRandInteger(), getRandInteger())); 
} 
string input = bigstr.ToString(); 

Stopwatch stopwatch = Stopwatch.StartNew(); 
var matches = Regex.Matches(input, @"\(([-+]?\d+\|[-+]?\d+)\)"); 
stopwatch.Stop(); 
Console.WriteLine("Time Elapsed :\t{0}\nInputLength :\t{1}\nMatches Count :\t{2}", stopwatch.Elapsed, input.Length, matches.Count);

這是我的控制檯輸出，如果我重複循環10倍。

Time Elapsed : 00:00:00.0004132 
InputLength : 1500 
Matches Count : 10

如果我重複循環1000次。

Time Elapsed : 00:00:00.0004373 // seriously? 
InputLength : 149937 
Matches Count : 1000

如果我重複循環1000000次。

Time Elapsed : 00:00:00.0004900 // wtf? 
InputLength : 149964452 
Matches Count : 1000000

，如果你不相信

這是某種惰性計算的屏幕截圖？如果是這樣，那麼它如何顯示比賽的數量？我怎麼在調試器下做到這一點，我可以看到比賽。

在我的正則表達式模式中有什麼特別的東西讓它變得快速嗎？但字符串長度如何不影響性能？我不明白。

來源

2015-11-08 M.kazem Akhgary

沒有什麼特別here.Your正則表達式引擎將Travers的字符串，並保存所有與您的正則表達式匹配的狀態，你是在1000時更大的字符串，它是不是一個大問題，現在adays machines.Just基準試更大的字符串。或者你的benchmarikon不公平。 – Kasramvd

如果您想了解一些有關.NET正則表達式引擎使用的字符串搜索算法的細節，您可能會對[本文的答案]（http://stackoverflow.com/a/32618592/3764814）感興趣。 –

正確的基準：https：//andreyakinshin.gitbooks.io/performancebookdotnet/content/science/microbenchmarking.html –

發生這種情況的原因之一是，matches.Count財產被評估爲懶洋洋地。當你停止你的秒錶時，匹配器準備做匹配，但它沒有找到任何東西。只有當您致電matches.Count它纔開始工作，但您到那時已停止計時。

一個簡單的解決方案是matches.Count呼叫轉移到您的時間的代碼的部分。

另一個原因是，你這已經是將解析正則表達式考慮的時間。這是一個相對昂貴的操作，所以在開啓秒錶計時好之前，你應該這樣做：

Regex testRegex = new Regex(@"\(([-+]?\d+\|[-+]?\d+)\)"); 
Stopwatch stopwatch = Stopwatch.StartNew(); 
var matches = testRegex.Matches(input); 
var matchesCount = matches.Count; 
stopwatch.Stop(); 
Console.WriteLine("Time Elapsed :\t{0}\nInputLength :\t{1}\nMatches Count :\t{2}", stopwatch.Elapsed, input.Length, matchesCount);

現在的時間爲10場比賽主場迎戰萬個匹配的增長會變得非常可觀（00:00:00.0129844與00:00:00.0843982），雖然不像輸入長度的千倍差異那樣大。

來源

2015-11-08 11:37:42 dasblinkenlight

大聲笑，我認爲它需要的時間實際上是生成字符串。多麼愚蠢的錯誤！我也在調試器下做了這個，但是我在'Regex.Matches'後面加入了斷點。不管怎麼說，還是要謝謝你！ –

@ M.kazemAkhgary以下是[MatchCollection'的一個實現]（http://reflector.webtropy.com/default.aspx/[email protected]/[email protected]/DEVDIV_TFS/Dev10/Releases/RTMRel/ndp/fx/ SRC /正則表達式/系統/文字/ RegularExpressions/RegexMatchCollection @ CS/1305376/RegexMatchCollection @ CS）。您可以從代碼中看到，執行所有工作的'GetMatch'方法被懶惰地調用。 – dasblinkenlight

@dasblinkenlight您提供的鏈接不適用於我（網頁格式錯誤，所有代碼與背景顏色相同）[此處是指向參考源中同一方法的鏈接]（http：// referencesource .microsoft.com /系統/正則表達式/系統/文本/ regularexpressions/RegexMatchCollection.cs.html＃682620f47b442b05） –

爲什麼正則表達式不關心字符串長度

回答

相關問題