python正則表達式匹配0或1個重複

-1

我想用html正則表達式匹配HTML標頭<h1> - <h6>。某些標題包含'id'屬性，我想將它放入一個組中。python正則表達式匹配0或1個重複

通過嘗試以下表達式，我得到了一個id屬性。

>>>re.findall(r'<h[1-6].*?(id=\".*?\").*?</h[1-6].*?>','<h1>Header1</h1><h2 id="header2">header2</h2>') 
['id="header2"']

的問號導致RE以匹配前面的RE的0或1的重複。如果我把？右括號後，它會返回兩個空字符串。

>>>re.findall(r'<h[1-6].*?(id=\".*?\")?.*?</h[1-6].*?>','<h1>Header1</h1><h2 id="header2">header2</h2>') 
['', '']

如何使用一個正則表達式來獲得以下結果？

['', 'id="header2"']

來源

2013-08-19 aaron cheung

嘗試後'去除問號'在你的第二個正則表達式（ID = \ 「* \。？」）。 – Jerry

首先閱讀：[這]（http://stackoverflow.com/a/1732454/2199958），然後使用[BeautifulSoup]（https://pypi.python.org/pypi/BeautifulSoup）:) –

然後它是與第一個正則表達式相同，輸出爲'['id =「header2」']' –

您正在使用錯誤的工具。不要使用正則表達式來解析HTML。改爲使用HTML解析器。

的BeautifulSoup library使你的任務變得簡單：

from bs4 import BeautifulSoup 

soup = BeautifulSoup(htmlsource) 

headers = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']) 
print [h.attrs.get('id', '') for h in headers]

演示：

>>> from bs4 import BeautifulSoup 
>>> htmlsource = '<h1>Header1</h1><h2 id="header2">header2</h2>' 
>>> soup = BeautifulSoup(htmlsource) 
>>> headers = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']) 
>>> [h.attrs.get('id', '') for h in headers] 
['', 'header2']

來源

2013-08-19 13:00:01

的 ''不匹配空格，所以你需要明確地包含它們。一種可能性是：

>>> re.findall(r'<h[1-6].*?(+id=\".*?\" ?)?.*?</h[1-6].*?>','<h1>Header1</h1><h2 id="header2">header2</h2>') 
['', ' id="header2"']

來源

2013-08-19 13:26:01 DeltaKappa

python正則表達式匹配0或1個重複

回答

相關問題