如何在使用Python中的BeautifulSoup的文本字符串後查找表格？

我想從幾個網頁中提取數據，這些網頁在他們顯示錶格的方式上並不統一。我需要編寫代碼來搜索文本字符串，然後在該特定文本字符串之後立即轉到表格。然後我想提取該表的內容。下面是我到目前爲止有：如何在使用Python中的BeautifulSoup的文本字符串後查找表格？

from BeautifulSoup import BeautifulSoup, SoupStrainer 
import re 

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>'] 
soup = BeautifulSoup(''.join(html)) 
searchtext = re.compile('Table 1',re.IGNORECASE) # Also need to figure out how to ignore space 
foundtext = soup.findAll('p',text=searchtext) 
soupafter = foundtext.findAllNext() 
table = soupafter.find('table') # find the next table after the search string is found 
rows = table.findAll('tr') 
for tr in rows: 
    cols = tr.findAll('td') 
    for td in cols: 
     try: 
      text = ''.join(td.find(text=True)) 
     except Exception: 
      text = "" 
     print text+"|", 
print

不過，我得到以下錯誤：

soupafter = foundtext.findAllNext() 
AttributeError: 'ResultSet' object has no attribute 'findAllNext'

是否有一個簡單的方法來做到這一點使用BeautifulSoup？

來源

2011-04-19 Josh Lee

該錯誤是由於這樣的事實：findAllNext是Tag對象的方法，但foundtext是ResultSet對象，這是匹配標籤或字符串的列表。您可以遍歷foundtext中的每個標記，但根據您的需要，使用find可能足夠，它只返回第一個匹配的標記。

這是您的代碼的修改版本。在將foundtext更改爲使用soup.find後，我發現並修復了與table相同的問題。我修改你的正則表達式來ignore whitespace between the words：

from BeautifulSoup import BeautifulSoup, SoupStrainer 
import re 

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>'] 
soup = BeautifulSoup(''.join(html)) 
searchtext = re.compile(r'Table\s+1',re.IGNORECASE) 
foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text 
table = foundtext.findNext('table') # Find the first <table> tag that follows it 
rows = table.findAll('tr') 
for tr in rows: 
    cols = tr.findAll('td') 
    for td in cols: 
     try: 
      text = ''.join(td.find(text=True)) 
     except Exception: 
      text = "" 
     print text+"|", 
    print

此輸出：

1. row 1, cell 1| 1. row 1, cell 2| 
1. row 2, cell 1| 1. row 2, cell 2|

來源

2011-04-19 06:17:12

如何在使用Python中的BeautifulSoup的文本字符串後查找表格？

回答

相關問題