4
我想從幾個網頁中提取數據,這些網頁在他們顯示錶格的方式上並不統一。我需要編寫代碼來搜索文本字符串,然後在該特定文本字符串之後立即轉到表格。然後我想提取該表的內容。下面是我到目前爲止有:如何在使用Python中的BeautifulSoup的文本字符串後查找表格?
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile('Table 1',re.IGNORECASE) # Also need to figure out how to ignore space
foundtext = soup.findAll('p',text=searchtext)
soupafter = foundtext.findAllNext()
table = soupafter.find('table') # find the next table after the search string is found
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
try:
text = ''.join(td.find(text=True))
except Exception:
text = ""
print text+"|",
print
不過,我得到以下錯誤:
soupafter = foundtext.findAllNext()
AttributeError: 'ResultSet' object has no attribute 'findAllNext'
是否有一個簡單的方法來做到這一點使用BeautifulSoup?