2014-09-21 61 views
0

我覺得我跟這個很接近,但是一旦我把標點捕捉移動到句尾,它就會陷入困境。正則表達式來匹配帶小數點和名字的句子

這句話的場景都低於:

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. This is a sentence  with odd spacing. This is one with lots of exclamation marks at the end!!!!This is another with a decimal 10.00 in the middle. Why is it so hard to find sentence endings?Last sentence without a space at the start. 

這將導致捕獲:

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. 
This is a sentence  with odd spacing. 
This is one with lots of exclamation marks at the end!!!! 
This is another with a decimal 10.00 in the middle. 
Why is it so hard to find sentence endings? 
Last sentence without a space at the start. 

這是表達我有:

.*?(?:[!?.;]+)((?<!(Mr|Mrs|Dr|Rev).?)(?=\D|\s+|$)(?:[^!?.;\d]|\d*\.?\d+)*)(?=(?:[!?.;]+)) 

有兩種問題如下:

  1. 標點符號處於起步
  2. 它正確地處理每個句子的一個名稱,但不是兩個(獎勵積分我想它正確捕獲「DJ史密斯先生」,但我不能工作如何止跌不匹配以單個字母結尾的句子。

進入這個的數據會有一些規範化,所以我們知道它會以句號結尾並且在一行中,但任何指針都是可以接受的。

+1

在正則表達式自然語言解析器?我可以放心地說,你永遠不會寫出封裝所有標點符號規則的正則表達式。再想一想。 – spender 2014-09-21 12:02:36

+1

在此之前我們已經有了一個NLP步驟,所以這是另一個想法(通過例外)歡呼。 – Tim 2014-09-21 12:31:10

回答

0

我同意@spender建議使用解析器來執行此操作來過濾所有標點符號規則。

但是,以下內容適用於您的場景。

foreach (Match m in Regex.Matches(s, @"(.*?(?<!(?:\b[A-Z]|Mrs?|Dr|Rev|\d))[!?.;]+)\s*")) 
     Console.WriteLine(m.Groups[1].Value); 

輸出

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. 
This is a sentence  with odd spacing. 
This is one with lots of exclamation marks at the end!!!! 
This is another with a decimal 10.00 in the middle. 
Why is it so hard to find sentence endings? 
Last sentence without a space at the start. 

Ideone Demo

+0

完美的感謝hwnd,我們將首先用自然語言解析器(或者更多到我們的命名實體解析器)預先處理輸入,並且會在句子結束之前解析句子中的所有「名稱」,但是您的解決方案會對標點符號進行排序謝謝。 – Tim 2014-09-21 12:32:49

相關問題