正則表達式來匹配帶小數點和名字的句子

我覺得我跟這個很接近，但是一旦我把標點捕捉移動到句尾，它就會陷入困境。正則表達式來匹配帶小數點和名字的句子

這句話的場景都低於：

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. This is a sentence  with odd spacing. This is one with lots of exclamation marks at the end!!!!This is another with a decimal 10.00 in the middle. Why is it so hard to find sentence endings?Last sentence without a space at the start.

這將導致捕獲：

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. 
This is a sentence  with odd spacing. 
This is one with lots of exclamation marks at the end!!!! 
This is another with a decimal 10.00 in the middle. 
Why is it so hard to find sentence endings? 
Last sentence without a space at the start.

這是表達我有：

.*?(?:[!?.;]+)((?<!(Mr|Mrs|Dr|Rev).?)(?=\D|\s+|$)(?:[^!?.;\d]|\d*\.?\d+)*)(?=(?:[!?.;]+))

有兩種問題如下：

標點符號處於起步
它正確地處理每個句子的一個名稱，但不是兩個（獎勵積分我想它正確捕獲「DJ史密斯先生」，但我不能工作如何止跌不匹配以單個字母結尾的句子。

進入這個的數據會有一些規範化，所以我們知道它會以句號結尾並且在一行中，但任何指針都是可以接受的。

來源

2014-09-21 Tim

在正則表達式自然語言解析器？我可以放心地說，你永遠不會寫出封裝所有標點符號規則的正則表達式。再想一想。 – spender 2014-09-21 12:02:36

在此之前我們已經有了一個NLP步驟，所以這是另一個想法（通過例外）歡呼。 – Tim 2014-09-21 12:31:10

我同意@spender建議使用解析器來執行此操作來過濾所有標點符號規則。

但是，以下內容適用於您的場景。

foreach (Match m in Regex.Matches(s, @"(.*?(?<!(?:\b[A-Z]|Mrs?|Dr|Rev|\d))[!?.;]+)\s*")) 
     Console.WriteLine(m.Groups[1].Value);

輸出

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. 
This is a sentence  with odd spacing. 
This is one with lots of exclamation marks at the end!!!! 
This is another with a decimal 10.00 in the middle. 
Why is it so hard to find sentence endings? 
Last sentence without a space at the start.

Ideone Demo

來源

2014-09-21 12:22:29 hwnd

完美的感謝hwnd，我們將首先用自然語言解析器（或者更多到我們的命名實體解析器）預先處理輸入，並且會在句子結束之前解析句子中的所有「名稱」，但是您的解決方案會對標點符號進行排序謝謝。 – Tim 2014-09-21 12:32:49

正則表達式來匹配帶小數點和名字的句子

回答

相關問題