如何使用BeautifulSoup從特定表中獲取所有行？

我正在學習Python和BeautifulSoup從網上抓取數據，並讀取一個HTML表格。我可以將它讀入Open Office，它說它是表＃11。如何使用BeautifulSoup從特定表中獲取所有行？

它似乎是BeautifulSoup是首選，但任何人都可以告訴我如何抓住一個特定的表和所有的行？我已經看過模塊文檔，但無法擺脫困境。我在網上找到的許多例子似乎比我需要的要多。

2010-01-06 Btibert3

如果你有一塊HTML用BeautifulSoup解析，這應該是非常簡單的。總體思路是使用findChildren方法導航到您的表格，然後使用string屬性獲取單元格內的文本值。

>>> from BeautifulSoup import BeautifulSoup 
>>> 
>>> html = """ 
... <html> 
... <body> 
...  <table> 
...   <th><td>column 1</td><td>column 2</td></th> 
...   <tr><td>value 1</td><td>value 2</td></tr> 
...  </table> 
... </body> 
... </html> 
... """ 
>>> 
>>> soup = BeautifulSoup(html) 
>>> tables = soup.findChildren('table') 
>>> 
>>> # This will get the first (and only) table. Your page may have more. 
>>> my_table = tables[0] 
>>> 
>>> # You can find children with multiple tags by passing a list of strings 
>>> rows = my_table.findChildren(['th', 'tr']) 
>>> 
>>> for row in rows: 
...  cells = row.findChildren('td') 
...  for cell in cells: 
...   value = cell.string 
...   print "The value in this cell is %s" % value 
... 
The value in this cell is column 1 
The value in this cell is column 2 
The value in this cell is value 1 
The value in this cell is value 2 
>>>

來源

2010-01-06 02:03:25

這就是訣竅！代碼工作，我應該能夠根據需要進行修改。非常感謝。最後一個問題。除了在孩子th和tr的表格中搜索時，我可以遵循這些代碼。這是簡單地搜索我的表並返回表頭和錶行嗎？如果我只想要表格行，我可以只搜索tr？非常感謝！ – Btibert3 2010-01-06 02:19:18

是的，'.findChildren（['th'，'tr']）'正在搜索標籤類型爲「th」或「tr」的元素。如果您只想查找'tr'元素，您可以使用'.findChildren（'tr'）'（注意不是列表，只是字符串） – 2010-01-08 22:15:51

值得注意的是[PyQuery]（https://pythonhosted.org /pyquery/api.html）是BeautifulSoup的一個非常好的選擇。 – 2014-06-27 15:31:14

如何使用BeautifulSoup從特定表中獲取所有行？

回答

相關問題