python正則表達式多個可選捕獲組

-1

我試圖捕獲多個組忽略情況，並且暫時沒有任何進展。我正在檢查該字符串可以有多種形式，一些例如python正則表達式多個可選捕獲組

<A title="Test title Ch.42" href="http://www.google.com">Test title Ch.42 </a> 
<A title="Test title Vol2. Ch.42" href="http://www.google.com">Test title Vol2. Ch.42 </a> 
<A title="Test title Vol2.Ch.42" href="http://www.google.com">Test title Vol2.Ch.42 </a> 
<A title="Test title \"with multiple quotes\" Ch.42" href="http://www.google.com">Test title "with multiple quotes" Ch.42 </a> 
<A title="Test title w1th numb3rs Ch.42" href="http://www.google.com">Test title w1th numb3rs Ch.42 </a> 
<A title="Test title no 42" href="http://www.google.com">Test title no 42 </a>

所以一般的規則是這樣的：

在標題標籤的主標題可以包含每一個字符包括數字和特殊字符
的URL是一個標準的URL，而可以是與（。*）表達捕獲毫無問題
The Ch。通常是可選的
如果字符串包含Vol。，Ch。獲得強制性

我現在的正則表達式如下：

pattern = re.compile('<A title="((.*)(?:Vol.[\d]+){0,1}(?: Ch.){0,1}([\d]+))" href="(.*)">')

我想嘗試和捕捉：

與卷和章節，包括其背後的號碼標題標籤
沒有Vol和Ch的標題（沒有Vol和C後面的數字） h）
Ch。

它會更好分裂正則表達式，這將是爲更好的性能（它運行在幾千字符串，所以我想保持它的高性能）？

親切的問候Baumchen

來源

2015-08-13 Baumchen

正則表達式是不是在這種情況下，分析的最佳工具，我想有工具恰好爲。然而，隨着給定的例子，你可以試試這個：

<a title="(.+?)\s?((Vol(\d+))?\s?\.?(Ch.(\d+)))?"\shref="(.+)">

DEMO

group(1) - 標題，
group(2) - 卷與NUM或/與章與NUM，
group(3) - 卷帶號碼
group(4) - 只有號碼（Vol）
group(5) - 中國與數
group(6) - 只有
group(7)（的CH）數 - URL

來源

2015-08-13 09:19:28

非常感謝完美的工作，你會推薦什麼樣的工具對於這種情況？ – Baumchen

@Baumchen我很高興它有幫助。我沒有Python背景，但我認爲任何html解析器都可以處理它，在leas partialy（也許[this one]）（https://docs.python.org/2/library/htmlparser.html）） –

python正則表達式多個可選捕獲組

回答

相關問題