正則表達式解析電子郵件 - 高CPU負載

可能重複：
c# regex email validation 正則表達式解析電子郵件 - 高CPU負載

我目前使用下面的正則表達式和代碼解析從HTML文檔的電子郵件地址

string pattern = @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*"; 
Regex regex = new Regex(
     pattern, 
     RegexOptions.None | RegexOptions.Compiled); 

MatchCollection matches = regex.Matches(input); // Here is where it takes time 
MessageBox.Show(matches.Count.ToString()); 

foreach (Match match in matches) 
{ 
    ... 
}

例如：

嘗試解析http://www.amelia.se/Pages/Amelia-search-result-page/?q=

在RegexHero上它崩潰了。

有什麼辦法來優化這個嗎？

來源

2012-10-10 Elvin

1.不要使用正則表達式解析HTML，請使用正確的解析器。 2.如果字符串是使用正則表達式的電子郵件，請使用庫（例如使用正則表達式的複雜性，請參見http://www.ex-parrot.com/pdw/Mail-RFC822 -Address.html） – Anders

我只能想到從任意HTML文檔中提取電子郵件地址的原因，這是我當然不會支持的。 – Philipp

請閱讀：http：//www.regular-expressions.info/catastrophic.html。這就是你的正則表達式很慢並且CPU負載很高的原因。 –

爲了詳細說明@喬伊的建議，我會主張通過逐行輸入，刪除任何不包含@的行，然後將你的正則表達式應用到所做的。這應該大大減少負載。

private List<Match> find_emails_matches() 
{ 
    List<Match> result = new List<Match>(); 

    using (FileStream stream = new FileStream(@"C:\tmp\test.txt", FileMode.Open, FileAccess.Read)) 
    { 
     using(StreamReader reader = new StreamReader(stream)) 
     { 
      string pattern = @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*"; 
      Regex regex = new Regex(pattern, RegexOptions.None | RegexOptions.Compiled); 

      string line; 
      while((line = reader.ReadLine()) != null) 
      { 
       if (line.Contains('@')) 
       { 
        MatchCollection matches = regex.Matches(line); // Here is where it takes time        
        foreach(Match m in matches) result.Add(m); 
       } 
      } 
     } 
    } 

    return result; 
}

來源

2012-10-10 09:02:05 zeFrenchy

我該怎麼做？ :) – Elvin

工程很好！謝謝一堆 – Elvin

不需要詳細說明我的答案;這種情況是錯誤的。我來自驗證角度，而不是提取。 – Joey

正則表達式解析電子郵件 - 高CPU負載

回答

相關問題