用python和BeautifulSoup從html中提取表格內容

我想從html文檔中提取某些信息。例如。它包含了一個表像這樣（與其他內容的其他表中）：用python和BeautifulSoup從html中提取表格內容

<table class="details"> 
      <tr> 
        <th>Advisory:</th> 
        <td>RHBA-2013:0947-1</td> 
      </tr> 
      <tr>  
        <th>Type:</th> 
        <td>Bug Fix Advisory</td> 
      </tr> 
      <tr> 
        <th>Severity:</th> 
        <td>N/A</td> 
      </tr> 
      <tr>  
        <th>Issued on:</th> 
        <td>2013-06-13</td> 
      </tr> 
      <tr>  
        <th>Last updated on:</th> 
        <td>2013-06-13</td> 
      </tr> 

      <tr> 
        <th valign="top">Affected Products:</th> 
        <td><a href="#Red Hat Enterprise Linux ELS (v. 4)">Red Hat Enterprise Linux ELS (v. 4)</a></td> 
      </tr> 


    </table>

我想提取喜歡的日期信息發佈的「關於」。它看起來像BeautifulSoup4 可以做到這一點很容易，但不知何故，我沒有設法讓它正確。到目前爲止我的代碼：

from bs4 import BeautifulSoup 
    soup=BeautifulSoup(unicodestring_containing_the_entire_htlm_doc) 
    table_tag=soup.table 
    if table_tag['class'] == ['details']: 
      print table_tag.tr.th.get_text() + " " + table_tag.tr.td.get_text() 
      a=table_tag.next_sibling 
      print unicode(a) 
      print table_tag.contents

這讓我第一個錶行的內容，也是內容的列表。但是，下一個兄弟姐妹的事情是不正確的，我想我只是用它錯了。當然，我可以解析內容thingy，但在我看來，美麗的湯旨在阻止我們做到這一點（如果我開始解析自己，我可能很好地解析了整個文檔...）。如果有人能夠啓發我如何去實現這一點，我會很樂意。如果有更好的方法，那麼BeautifulSoup，我會有興趣聽說它。

來源

2013-06-19 Isaac

>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(unicodestring_containing_the_entire_htlm_doc) 
>>> table = soup.find('table', {'class': 'details'}) 
>>> th = table.find('th', text='Issued on:') 
>>> th 
<th>Issued on:</th> 
>>> td = th.findNext('td') 
>>> td 
<td>2013-06-13</td> 
>>> td.text 
u'2013-06-13'

來源

2013-06-19 16:43:55 falsetru

用python和BeautifulSoup從html中提取表格內容

回答

相關問題