蟒蛇正則表達式：提取HTML元素的內容

我在這個格式的HTML頁面元素：蟒蛇正則表達式：提取HTML元素的內容

<td class="cell1"><b>Dave Mason's Traffic Jam</b></td><td class="cell2">Scottish Rite 
Auditorium</td><td class="cell3">$29-$45</td><td class="cell4">On sale now</td><td class="cell5"><a 
href="http://www.ticketmaster.com/dave-masons-traffic-jam-collingswood-new-jersey-11-29-2014/event 
/02004B48C416D202?artistid=1033927&majorcatid=10001&minorcatid=1&tm_link=venue_msg- 
1_02004B48C416D202" target="_blank">TIX</a></td><td class="cell6">AA</td><td 
class="cell7">Philadelphia</td>

我想用Python來提取「戴夫·梅森的交通堵塞」的一部分，「蘇格蘭禮禮堂「部分等。使用此正則表達式'。*'從第一個標記返回到下一個換行符之前的最後一個標記。如何更改表達式，以便它只返回標記對之間的塊？

編輯：@HenryKeiter & @Hakiko這將是盛大的，但這對於需要我使用Python正則表達式的任務。

來源

2014-05-10 Swanijam

使用一個真正的HTML解析器像[BeautifulSoup（http://beautiful-soup-4.readthedocs.org/en/latest/）。不要試圖用正則表達式解析HTML。 [這就是瘋狂。]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454） –

我認爲用正則表達式提取你的內容頭痛。使用HTML解析器。 – hakiko

Re：你的編輯：如果你需要使用正則表達式，你需要更好地定義你需要提取的東西。所有細胞的所有內容？只是某些？找出如何描述你需要匹配的內容，剩下的只是學習使用[regex語法]。（https://docs.python.org/2/library/re.html#regular-expression-語法）提示：一旦找出邊界應該是什麼，''。*''可能會在你想要的中間。 –

這是一個提示，而不是一個完整的解決方案：你需要在你的情況下使用非貪婪的正則表達式。基本上，你需要使用的

.*?

代替

.*

非貪婪意味着一個最小的模式將被匹配。默認情況下 - 最大。

來源

2014-05-10 22:38:13

使用Beautiful Soup：

from bs4 import BeautifulSoup 

html = ''' 
<td class="cell1"><b>Dave Mason's Traffic Jam</b></td><td class="cell2">Scottish Rite 
Auditorium</td><td class="cell3">$29-$45</td><td class="cell4">On sale now</td><td class="cell5"><a 
href="http://www.ticketmaster.com/dave-masons-traffic-jam-collingswood-new-jersey-11-29-2014/event 
/02004B48C416D202?artistid=1033927&majorcatid=10001&minorcatid=1&tm_link=venue_msg- 
1_02004B48C416D202" target="_blank">TIX</a></td><td class="cell6">AA</td><td 
class="cell7">Philadelphia</td> 
'''.strip() 

soup = BeautifulSoup(html) 
tds = soup.find_all('td') 
contentList = [] 
for td in tds: 
    contentList.append(td.get_text()) 
print contentList

[u"Dave Mason's Traffic Jam", u'Scottish Rite\nAuditorium', u'$29-$45', u'On sale now', u'TIX', u'AA', u'Philadelphia']

來源

2014-05-10 22:53:42

蟒蛇正則表達式：提取HTML元素的內容

回答

相關問題