我試着在前些日子問過這個問題,並且最初並沒有說好問題或郵政編碼,答案已經結束。所以在這裏我再次嘗試,因爲老實說,這很快就讓我瘋狂。 :)大型正則表達式匹配導致程序掛起
我想實現這Address Parser,它最初是一個基於控制檯的c#程序。我已成功將其轉換爲獨立的WPF程序,該程序僅包含用於輸入的TextBox
,用於激活解析的Button
以及用於顯示結果的TextBlock
。在寫這篇文章的時候,我的輸出截斷了我在主程序中需要的輸出,並且仍然可以正常工作。我已經在下面列出了完整的代碼。
我的下一步是將其移植到我的主程序中,我通過字面上使用複製/粘貼來完成。然而,運行這個程序後,程序在按下按鈕後掛起。最終,VS會給出一個錯誤,說明進程已經太久而沒有輸出消息,並且TaskManager中的內存使用量從〜70k逐漸增加到3,000,000。爲此,我將Parsing
方法分配給後臺工作人員,希望減輕主進程的工作量。這確實解決了程序凍結,但後臺線程只是做了同樣的事情,提高了內存使用率,並沒有返回任何東西。
所以現在我有點陷入僵局。我知道這個問題是在var result = parser.ParseAddress(input);
聲明中的某處,因爲當爲每一行代碼使用斷點時,這是最後一次觸發。但基本上,我不知道爲什麼這會在一個WPF程序中導致問題,而不是另一個。
如果需要的話,我會非常樂意爲主程序發佈完整的源代碼,但是我無法想象在這裏發佈大約20個不同的類文件和代碼項目是個好主意。 :)
單機WPF應用程序
namespace AddressParseWPF
{
/// <summary>
/// Interaction logic for MainWindow.xaml
/// </summary>
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
}
public void Execute()
{
AddressParser.AddressParser parser = new AddressParser.AddressParser();
var input = inputTextBox.Text;
var result = parser.ParseAddress(input);
if (result == null)
{
outputTextBlock.Text = "ERROR. Input could not be parsed.";
}
else
{
outputTextBlock.Text = (result.StreetLine + ", " + result.City + ", " + result.State + " " + result.Zip);
}
}
private void actionButton_Click(object sender, RoutedEventArgs e)
{
Execute();
}
}
}
主程序接枝分析器成
public void ExecuteAddressParse()
{
AddressParser.AddressParser parser = new AddressParser.AddressParser();
var input = inputTextBox.Text;
var result = parser.ParseAddress(input);
if (result == null)
{
outputTextBlock.Text = "ERROR. Input could not be parsed.";
}
else
{
outputTextBlock.Text = (result.StreetLine + ", " + result.City + ", " + result.State + " " + result.Zip);
}
}
private void actionButton_Click(object sender, RoutedEventArgs e)
{
ExecuteAddressParse();
}
ParseAddress方法
public AddressParseResult ParseAddress(string input)
{
if (!string.IsNullOrWhiteSpace(input))
{
var match = addressRegex.Match(input.ToUpperInvariant());
if (match.Success)
{
var extracted = GetApplicableFields(match);
return new AddressParseResult(Normalize(extracted));
}
}
return null;
}
省略
RegexOptions.Compiled
標誌時10
正則表達式匹配方法
private static void InitializeRegex()
{
var suffixPattern = new Regex(
string.Join(
"|",
new [] {
string.Join("|", suffixes.Keys),
string.Join("|", suffixes.Values.Distinct())
}),
RegexOptions.Compiled);
var statePattern =
@"\b(?:" +
string.Join(
"|",
new [] {
string.Join("|", states.Keys.Select(x => Regex.Escape(x))),
string.Join("|", states.Values)
}) +
@")\b";
var directionalPattern =
string.Join(
"|",
new [] {
string.Join("|", directionals.Keys),
string.Join("|", directionals.Values),
string.Join("|", directionals.Values.Select(x => Regex.Replace(x, @"(\w)", @"$1\.")))
});
var zipPattern = @"\d{5}(?:-?\d{4})?";
var numberPattern =
@"(
((?<NUMBER>\d+)(?<SECONDARYNUMBER>(-[0-9])|(\-?[A-Z]))(?=\b)) # Unit-attached
|(?<NUMBER>\d+[\-\ ]?\d+\/\d+) # Fractional
|(?<NUMBER>\d+-?\d*) # Normal Number
|(?<NUMBER>[NSWE]\ ?\d+\ ?[NSWE]\ ?\d+) # Wisconsin/Illinois
)";
var streetPattern =
string.Format(
CultureInfo.InvariantCulture,
@"
(?:
# special case for addresses like 100 South Street
(?:(?<STREET>{0})\W+
(?<SUFFIX>{1})\b)
|
(?:(?<PREDIRECTIONAL>{0})\W+)?
(?:
(?<STREET>[^,]*\d)
(?:[^\w,]*(?<POSTDIRECTIONAL>{0})\b)
|
(?<STREET>[^,]+)
(?:[^\w,]+(?<SUFFIX>{1})\b)
(?:[^\w,]+(?<POSTDIRECTIONAL>{0})\b)?
|
(?<STREET>[^,]+?)
(?:[^\w,]+(?<SUFFIX>{1})\b)?
(?:[^\w,]+(?<POSTDIRECTIONAL>{0})\b)?
)
)
",
directionalPattern,
suffixPattern);
var rangedSecondaryUnitPattern =
@"(?<SECONDARYUNIT>" +
string.Join("|", rangedSecondaryUnits.Keys) +
@")(?![a-z])";
var rangelessSecondaryUnitPattern =
@"(?<SECONDARYUNIT>" +
string.Join(
"|",
string.Join("|", rangelessSecondaryUnits.Keys)) +
@")\b";
var allSecondaryUnitPattern = string.Format(
CultureInfo.InvariantCulture,
@"
(
(:?
(?: (?:{0} \W*)
| (?<SECONDARYUNIT>\#)\W*
)
(?<SECONDARYNUMBER>[\w-]+)
)
|{1}
),?
",
rangedSecondaryUnitPattern,
rangelessSecondaryUnitPattern);
var cityAndStatePattern = string.Format(
CultureInfo.InvariantCulture,
@"
(?:
(?<CITY>[^\d,]+?)\W+
(?<STATE>{0})
)
",
statePattern);
var placePattern = string.Format(
CultureInfo.InvariantCulture,
@"
(?:{0}\W*)?
(?:(?<ZIP>{1}))?
",
cityAndStatePattern,
zipPattern);
var addressPattern = string.Format(
CultureInfo.InvariantCulture,
@"
^
# Special case for APO/FPO/DPO addresses
(
[^\w\#]*
(?<STREETLINE>.+?)
(?<CITY>[AFD]PO)\W+
(?<STATE>A[AEP])\W+
(?<ZIP>{4})
\W*
)
|
# Special case for PO boxes
(
\W*
(?<STREETLINE>(P[\.\ ]?O[\.\ ]?\)?BOX\ [0-9]+)\W+
{3}
\W*
)
|
(
[^\w\#]* # skip non-word chars except # (eg unit)
( {0})\W*
{1}\W+
(?:{2}\W+)?
{3}
\W* # require on non-word chars at end
)
$ # right up to end of string
",
numberPattern,
streetPattern,
allSecondaryUnitPattern,
placePattern,
zipPattern);
addressRegex = new Regex(
addressPattern,
RegexOptions.Compiled |
RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);
}
不知您輸入? – BlueM
帶回溯的正則表達式可能非常緩慢,特別是在大量輸入時。嘗試將您對錶達式的輸入一次限制爲三行:在「ParseAddress」方法中,將文本拆分爲單獨的行,並將其分配給循環中的正則表達式,行{0,1,2},然後{ 1,2,3},然後{2,3,4}等等。找到匹配項時返回。 – dasblinkenlight
當您移動代碼時,您有沒有偶然修改正則表達式?或者在複製和粘貼時修改輸入?你初始化了解析器嗎?還是你無意中雙擊了按鈕,從而觸發了庫中的一些奇怪的錯誤?高內存消耗指向一個糟糕的正則表達式/輸入組合 - 儘管運行時行爲不佳 – skarmats