2013-01-17 55 views
4

我正在遇到一些麻煩,「挑選」這些數據「分開」。 Altough幫助函數等的一個選項,我真的很想解決這個問題,只使用正則表達式(匹配後處理matchgroups)。正則表達式匹配(貪婪/不真實?)

這是(部分)我有數據:

Belgium 
Belgium M_Foo 
Belgium A_Bar 
Belgium M_FooBar 
Belgium S_Whooptee Doo 
Belgium Xxx 
Belgium S_Foo Bar 
United Kingdom 
United Kingdom W_Foo-Bar 
United Kingdom M_Yay 
United Kingdom Xxx 
United Kingdom S_Derp 
United Kingdom F_Doh Lorem 
United Kingdom S_Ipsum Dolor 
United States of America L_Foo 
Macedonia F.Y.R. Xxx 
Macedonia F.Y.R. S_Foo Bar 
Cyprus (Greek) M_Foo 
Congo (Democratic Republic of) 
Congo (Democratic Republic of) Q_Yolo 

本質上這是一個「鍵/值」排序字符串數組的。它包含一個國家/地區名稱(未歸一化,因此我不能使用硬編碼的國家/地區名稱或「查找」,它可能是其他字符串,而不是國家/地區名稱),並且optionally後跟關鍵字Xxx<random_upcase_char>_<random_text>

我想出了下面的正則表達式:

^(.+?)(?:\s+(Xxx|[A-Z]_.*)?) 

,或者在第一matchgroup的小分歧:

^(.*?)(?:\s+(Xxx|[A-Z]_.*)?) 

這工作正常啓動與Belgium第一個字符串。它返回,對於這些記錄,結果如下:

Group 1  Group 2 
================================ 
Belgium 
Belgium  M_Foo 
Belgium  A_Bar 
Belgium  M_FooBar 
Belgium  S_Whooptee Doo 
Belgium  Xxx 
Belgium  S_Foo Bar 

但是,下面的線造成麻煩:

Group 1  Group 2 
================================ 
United 
United 
United 
United 
United 
United 
United 
United 
Macedonia 
Macedonia 
Cyprus 
Congo 
Congo 

我想什麼正則表達式做的是以下幾點:

Group 1       Group 2 
================================================ 
United Kingdom 
United Kingdom     W_Foo-Bar 
United Kingdom     M_Yay 
United Kingdom     Xxx 
United Kingdom     S_Derp 
United Kingdom     F_Doh Lorem 
United Kingdom     S_Ipsum Dolor 
United States of America  L_Foo 
Macedonia F.Y.R.    Xxx 
Macedonia F.Y.R.    S_Foo Bar 
Cyprus (Greek)     M_Foo 
Congo (Democratic Republic of) 
Congo (Democratic Republic of) Q_Yolo 

但我不能得到第一部分匹配。我很確定它與第一個匹配組的greedy/ungreedy選項有關,但在經過一段時間的擺弄之後,我無法完成它的工作......

我不在乎額外/其他/更多的匹配組被返回。正則表達式旨在用於.Net C#應用程序中(如果您想知道這是哪種「方言」)。

任何幫助將非常感激。

回答

0

有時,在非貪婪的匹配中,錨定非常重要。在這種情況下,錨定到線路末端解決了問題。你的正則表達式應該是:

^(.+?)(?:\s+(Xxx|[A-Z]_.*))?$ 

請注意,我也動了一個多分組級別外的可選(?)量詞,所以空間是可選的。

+0

http://regex101.com/r/aB8dH8似乎並不匹配'Belgium','剛果(民主共和國)'和'美國Kingdom'雖然。 – h2ooooooo

+0

@ h2ooooooo:我想這與'.Net'正則表達式語法中的細微變化有關嗎?看起來(初看)在RAD軟件公司的正則表達式設計器中工作**幾乎** ......然而,「比利時Xxx」和其他「... Xxx」未通過測試......我試圖看看我是否可以修復它... – RobIII

+0

編輯;如果沒有更多的數據,它沒有考慮在州名之後不存在的空間。 –

0

我管理你想要的東西與此正則表達式(與多跑):

^((?:.+?|)+?)(?:\s+(Xxx|[A-Z]_.*)|\s)?$ 

使用輸入給了我這樣的結果:

1: Belgium     2: 
1: Belgium     2: M_Foo 
1: Belgium     2: A_Bar 
1: Belgium     2: M_FooBar 
1: Belgium     2: S_Whooptee Doo 
1: Belgium     2: Xxx 
1: Belgium     2: S_Foo Bar 
1: United Kingdom   2: 
1: United Kingdom   2: W_Foo-Bar 
1: United Kingdom   2: M_Yay 
1: United Kingdom   2: Xxx 
1: United Kingdom   2: S_Derp 
1: United Kingdom   2: F_Doh Lorem 
1: United Kingdom   2: S_Ipsum Dolor 
1: United States of America 2: L_Foo 
1: Macedonia F.Y.R.   2: Xxx 
1: Macedonia F.Y.R.   2: S_Foo Bar 
1: Cyprus (Greek)   2: M_Foo 
+0

請參閱http://riii.nl/mymrc;這個正則表達式(像其他一些發佈)不匹配' Xxx'對我來說? – RobIII

+0

@RobIII使用http://gskinner.com/RegExr/你的輸入與這個正則表達式似乎匹配所有= /。 http://regex101.com/r/xA3kC9確認工作 – Destrictor

0

試試這個(不區分大小寫):

^([A-Z]+(?:\s+(?!Xxx)[A-Z]+)*(?:\s+\([^)]+\))?)(?:\s+(Xxx|(?:[-A-Z_.]+(?:\s+[-A-Z_.]+)*)))?$ 

它適用於你所有的例子。但坦率地說,你應該正確地分隔數據。

演示:

$ perl -ne '/^([A-Z]+(?:\s+(?!Xxx)[A-Z]+)*(?:\s+\([^)]+\))?)(?:\s+(Xxx|(?:[-A-Z_.]+(?:\s+[-A-Z_.]+)*)))?$/i and print "MATCH: group 1 is \"$1\", group 2 is \"$2\"\n"' 
> Belgium 
> Belgium M_Foo 
> Belgium A_Bar 
> Belgium M_FooBar 
> Belgium S_Whooptee Doo 
> Belgium Xxx 
> Belgium S_Foo Bar 
> United Kingdom 
> United Kingdom W_Foo-Bar 
> United Kingdom M_Yay 
> United Kingdom Xxx 
> United Kingdom S_Derp 
> United Kingdom F_Doh Lorem 
> United Kingdom S_Ipsum Dolor 
> United States of America L_Foo 
> Macedonia F.Y.R. Xxx 
> Macedonia F.Y.R. S_Foo Bar 
> Cyprus (Greek) M_Foo 
> Congo (Democratic Republic of) 
> Congo (Democratic Republic of) Q_Yolo 
> EOF 
MATCH: group 1 is "Belgium", group 2 is "" 
MATCH: group 1 is "Belgium", group 2 is "M_Foo" 
MATCH: group 1 is "Belgium", group 2 is "A_Bar" 
MATCH: group 1 is "Belgium", group 2 is "M_FooBar" 
MATCH: group 1 is "Belgium", group 2 is "S_Whooptee Doo" 
MATCH: group 1 is "Belgium", group 2 is "Xxx" 
MATCH: group 1 is "Belgium", group 2 is "S_Foo Bar" 
MATCH: group 1 is "United Kingdom", group 2 is "" 
MATCH: group 1 is "United Kingdom", group 2 is "W_Foo-Bar" 
MATCH: group 1 is "United Kingdom", group 2 is "M_Yay" 
MATCH: group 1 is "United Kingdom", group 2 is "Xxx" 
MATCH: group 1 is "United Kingdom", group 2 is "S_Derp" 
MATCH: group 1 is "United Kingdom", group 2 is "F_Doh Lorem" 
MATCH: group 1 is "United Kingdom", group 2 is "S_Ipsum Dolor" 
MATCH: group 1 is "United States of America", group 2 is "L_Foo" 
MATCH: group 1 is "Macedonia", group 2 is "F.Y.R. Xxx" 
MATCH: group 1 is "Macedonia", group 2 is "F.Y.R. S_Foo Bar" 
MATCH: group 1 is "Cyprus (Greek)", group 2 is "M_Foo" 
MATCH: group 1 is "Congo (Democratic Republic of)", group 2 is "" 
MATCH: group 1 is "Congo (Democratic Republic of)", group 2 is "Q_Yolo" 
+0

你是對的,我需要更好的數據。不幸的是,這些數據的供應商不能也不會解決這個問題...... :( – RobIII

+0

添加了剛果的示例注意:你已經在.NET中命名捕獲了,你也應該使用這個命令。 – fge

+0

我將在添加命名捕獲後工作;) – RobIII

0

/(?:^(.+)\s+(Xxx|[A-Z]_.+)$|^(.+)$)/gm會匹配所有字符串,然而,只能有一個國家將被放置在第三場比賽的任何行(所以檢查這是你通過你的結果) 。

Demo

+0

我不明白...爲什麼'比利時Xxx'(和其他...... Xxx)不適合我? (見http://i.imgur.com/mfhpU.png) – RobIII