使用Python從HTML網站提取多行數據

因此，只要我匹配的內容不會超過1行，如果它跨越多於1行，我就有胃灼熱貌似）......這裏的HTML數據的片斷我得到：使用Python從HTML網站提取多行數據

<tr> 
<td width=20%>3 month 
<td width=1% class=bar> 
&nbsp; 
<td width=1% nowrap class="value chg">+10.03% 
<td width=54% class=bar> 
<table width=100% cellpadding=0 cellspacing=0 class=barChart> 
<tr>

我感興趣的是「+ 10.03％」號和

<td width=20%>3 month

的是，讓我知道格局「+ 10.03％」是我想要的。

所以我在Python得到這個至今：

percent = re.search('<td width=20%>3 month\r\n<td width=1% class=bar>\r\n&nbsp;\r\n<td width=1% nowrap class="value chg">(.*?)', content)

其中變量的內容擁有所有的HTML代碼，我在尋找。這似乎不適用於我...任何意見將不勝感激！我讀過一些其他職位，談論re.compile（）和re.multiline（），但我沒有任何運氣，他們主要是因爲我不明白他們是如何工作，我猜...

來源

2013-10-09 skbeez

不要使用正則表達式來解析HTML。它總是以心痛結束。 – tehsockz

不要使用正則表達式來解析HTML。這是一個糟糕的主意，因爲它可能會很快變得複雜。使用類似['HTMLParser']（http://docs.python.org/2/library/htmlparser.html）。 –

所以我嘗試HTMLParser但BeautifulSoup似乎更好地工作...（HTMLParser返回一個錯誤的標籤錯誤），但我有點困惑如何讓它來搜索我的10.03％的數字..我搜索 skbeez

感謝大家的幫助！您指出我正確的方向，這是我如何讓我的代碼與BeautifulSoup一起工作。我注意到，所有我想要的數據是一個名爲「值CHG」，其次是類之下，我的數據總是在搜索的第3和第5個元素，所以這是我做過什麼：

from BeautifulSoup import BeautifulSoup 
import urllib 

content = urllib.urlopen(url).read() 
soup = BeautifulSoup(''.join(content)) 

td_list = soup.findAll('td', {'class':'value chg'}) 

mon3 = td_list[2].text.encode('ascii','ignore') 
yr1 = td_list[4].text.encode('ascii','ignore')

再次，「內容「是我下載的HTML。

來源

2013-10-09 07:15:12 skbeez

您需要添加」多行「正則表達式開關(?m)。您可以通過findall(regex, content)[0]直接提取使用findall並採取本場比賽的第一個元素的目標內容：

percent = re.findall(r'(?m)<td width=20%>3 month\s*<td width=1% class=bar>\s*&nbsp;\s*<td width=1% nowrap class="value chg">(\S+)', content)[0]

通過使用\s*匹配換行符，正則表達式是UNIX和Windows風格的行終止兼容。

請參見下面的測試代碼的live demo：

import re 
content = '<tr>\n<td width=20%>3 month\n<td width=1% class=bar>\n&nbsp;\n<td width=1% nowrap class="value chg">+10.03%\n<td width=54% class=bar>\n<table width=100% cellpadding=0 cellspacing=0 class=barChart>\n<tr>'   
percent = re.findall(r'(?m)<td width=20%>3 month\s*<td width=1% class=bar>\s*&nbsp;\s*<td width=1% nowrap class="value chg">(\S+)', content)[0] 
print(percent)

輸出：

+10.03%

來源

2013-10-09 13:38:19 Bohemian

使用Python從HTML網站提取多行數據

回答

相關問題