2011-04-19 59 views
4

我想從幾個網頁中提取數據,這些網頁在他們顯示錶格的方式上並不統一。我需要編寫代碼來搜索文本字符串,然後在該特定文本字符串之後立即轉到表格。然後我想提取該表的內容。下面是我到目前爲止有:如何在使用Python中的BeautifulSoup的文本字符串後查找表格?

from BeautifulSoup import BeautifulSoup, SoupStrainer 
import re 

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>'] 
soup = BeautifulSoup(''.join(html)) 
searchtext = re.compile('Table 1',re.IGNORECASE) # Also need to figure out how to ignore space 
foundtext = soup.findAll('p',text=searchtext) 
soupafter = foundtext.findAllNext() 
table = soupafter.find('table') # find the next table after the search string is found 
rows = table.findAll('tr') 
for tr in rows: 
    cols = tr.findAll('td') 
    for td in cols: 
     try: 
      text = ''.join(td.find(text=True)) 
     except Exception: 
      text = "" 
     print text+"|", 
print 

不過,我得到以下錯誤:

soupafter = foundtext.findAllNext() 
AttributeError: 'ResultSet' object has no attribute 'findAllNext' 

是否有一個簡單的方法來做到這一點使用BeautifulSoup?

回答

6

該錯誤是由於這樣的事實:findAllNextTag對象的方法,但foundtextResultSet對象,這是匹配標籤或字符串的列表。您可以遍歷foundtext中的每個標記,但根據您的需要,使用find可能足夠,它只返回第一個匹配的標記。

這是您的代碼的修改版本。在將foundtext更改爲使用soup.find後,我發現並修復了與table相同的問題。我修改你的正則表達式來ignore whitespace between the words

from BeautifulSoup import BeautifulSoup, SoupStrainer 
import re 

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>'] 
soup = BeautifulSoup(''.join(html)) 
searchtext = re.compile(r'Table\s+1',re.IGNORECASE) 
foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text 
table = foundtext.findNext('table') # Find the first <table> tag that follows it 
rows = table.findAll('tr') 
for tr in rows: 
    cols = tr.findAll('td') 
    for td in cols: 
     try: 
      text = ''.join(td.find(text=True)) 
     except Exception: 
      text = "" 
     print text+"|", 
    print 

此輸出:

1. row 1, cell 1| 1. row 1, cell 2| 
1. row 2, cell 1| 1. row 2, cell 2| 
相關問題