.NET正則表達式的HTML標題

我試圖提取所有數據的一個字文檔中標題標籤的轉換（通過字）.NET正則表達式的HTML標題

的HTML我有以下的正則表達式：

<(?<Class>h[5|6|7|8])>(?<ListIdentifier>.*?)<span style='font:7.0pt "Times New Roman"'>(?:&nbsp;)+.+</span>(?<Text>.*?)(?:</h[5|6|7|8]>)?

和我的原文如下所示

<h5>(1)<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
</span>The Scheme (planning scheme) has been 
prepared in accordance with the <i>asdf </i>(the Act) 
as a framework for managing development in a way that advances the purpose of 
the Act.</h5> 

<h5>(2)<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
</span>In seeking to achieve this purpose, the planning scheme sets out 
the future development in the 
planning scheme area over the next 20 years.</h5> 

<h5>(3)<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
</span>While the planning scheme has been prepared with a 20 year horizon, it 
will be reviewed periodically in accordance with the Act to ensure that it 
responds appropriately to the changes of the community at Local, Regional and State 
levels.</h5>

正則表達式但似乎工作就從第一個H5捕獲到最後一個或任何其他H6 | 7 | 8。

我沒有試圖做任何事情複雜在這裏與數據只需要一個簡單的提取，所以我想堅持使用正則表達式，而不是使用一個HTML解析器，這是公平的說我的例子標題格式良好，即。一個hX總是被一個hX而不是一個hY關閉，而標題裏面沒有標題或任何類似的東西。

我想加入？到（結束了嗎？:)將使nongreedy所以它只會匹配的第一個實例，而不是儘可能多的，因爲它可以，我失去了一些東西在這裏對貪婪是如何工作的？

編輯：

正則表達式

<(?<Class>h[5-8])>(?<ListIdentifier>.*?)<span style='font:7.0pt "Times New Roman"'>(?:&nbsp;)+.+?</span>(?<Text>.*?)(?:</h[5-8]>)

似乎也符合

<h6>&nbsp;</h6> 

<h6>&nbsp;</h6> 

<h6>&nbsp;</h6> 

<h6>&nbsp;</h6> 

<h5>(1)<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
</span>Short Title -The planning scheme policy may be cited as PSP No 2. – 
Engineering Standards – Road and Drainage Infrastructure.</h5>

所以它包括整個文本，而我想它忽略與NBSP作爲h6s他們沒有這個跨度內他們

來源

2012-02-08 Daniel Powell

有一個貪婪的.+所（只是</span>之前）產生問題的正則表達式的中間。將其更改爲.+?，並且您的正則表達式應該可以正常工作。

請注意，您的角色類別應爲[5678]而不是[5|6|7|8]（暗示字符之間的或），甚至可以縮短爲[5-8]。

您還應該從尾部刪除尾部?，(?:</h[5-8]>)?應該是(?:</h[5-8]>)。如果沒有這個改變，你的比賽會在它應該結束之前結束

編輯：，目前正則表達式是匹配的是你把你編輯的文本的原因是ListIdentifier組中的.*?將匹配</hX>如果跨度和NBSP之前都沒有見過。您應該能夠通過改變來解決這一點，.*?到[^<]*，這將不匹配比任何跡象較少，因此將要求跨度是存在的。

結果：

<(?<Class>h[5-8])>(?<ListIdentifier>[^<]*)<span style='font:7.0pt "Times New Roman"'>(?:&nbsp;)+.+?</span>(?<Text>.*?)(?:</h[5-8]>)

來源

2012-02-08 23:06:51

你客氣了，先生是一位紳士和學者！ – 2012-02-08 23:16:15

怎麼樣從還可選配

somerandomtext，這並不符合

– 2012-02-09 03:55:54

您可以編輯您的問題，以顯示其匹配應該不是文本的任何團體停止呢？我在編輯中看到的內容不會像現在這樣匹配。 – 2012-02-09 04:39:33

.NET正則表達式的HTML標題

回答

somerandomtext，這並不符合

相關問題