請幫助解析這個HTML表格使用BeautifulSoup和lxml pythonic方式

我已經搜索了很多關於BeautifulSoup和一些建議lxml作爲BeautifulSoup的未來，雖然這是有道理的，我有一個艱難的時間從一個整體解析下表網頁上的表格列表。請幫助解析這個HTML表格使用BeautifulSoup和lxml pythonic方式

我感興趣的是具有不同行數的三列，具體取決於頁面和檢查時間。 BeautifulSoup和lxml解決方案非常感謝。這樣我可以要求管理員在開發中安裝lxml。機。

所需的輸出：

Website     Last Visited   Last Loaded 
http://google.com   01/14/2011 
http://stackoverflow.com 01/10/2011 
...... more if present

下面是一個混亂的網頁的代碼示例：

<table border="2" width="100%"> 
    <tbody><tr> 
    <td width="33%" class="BoldTD">Website</td> 
    <td width="33%" class="BoldTD">Last Visited</td> 
    <td width="34%" class="BoldTD">Last Loaded</td> 
    </tr> 
    <tr> 
    <td width="33%"> 
     <a href="http://google.com"</a> 
    </td> 
    <td width="33%">01/14/2011 
      </td> 
    <td width="34%"> 
      </td> 
    </tr> 
    <tr> 
    <td width="33%"> 
     <a href="http://stackoverflow.com"</a> 
    </td> 
    <td width="33%">01/10/2011 
      </td> 
    <td width="34%"> 
      </td> 
    </tr> 
</tbody></table>

來源

2011-01-21 ThinkCode

您想要什麼結果？一個字典條目與sitename和日期？什麼是html的來源？它在你的控制之內嗎？ – Spaceghost 2011-01-21 17:33:17

不幸的是，html的來源不在我的控制之下。字典條目將起作用，只是沒有。如「所需輸出」部分所示，各行的行數會有所不同。沒有與該表關聯的類，因此如果表中有內容中的「網站」，那麼我們會抓取該數據。 – ThinkCode 2011-01-21 17:37:53

這是一個使用HTMLParser的一個版本。我試圖對pastebin.com/tu7dfeRJ的內容。它處理元標記和文檔類型聲明，這兩者都挫敗了ElementTree版本。

from HTMLParser import HTMLParser 

class MyParser(HTMLParser): 
    def __init__(self): 
    HTMLParser.__init__(self) 
    self.line = "" 
    self.in_tr = False 
    self.in_table = False 

    def handle_starttag(self, tag, attrs): 
    if self.in_table and tag == "tr": 
     self.line = "" 
     self.in_tr = True 
    if tag=='a': 
    for attr in attrs: 
     if attr[0] == 'href': 
     self.line += attr[1] + " " 

    def handle_endtag(self, tag): 
    if tag == 'tr': 
     self.in_tr = False 
     if len(self.line): 
     print self.line 
    elif tag == "table": 
     self.in_table = False 

    def handle_data(self, data): 
    if data == "Website": 
     self.in_table = 1 
    elif self.in_tr: 
     data = data.strip() 
     if data: 
     self.line += data.strip() + " " 

if __name__ == '__main__': 
    myp = MyParser() 
    myp.feed(open('table.html').read())

希望這可以解決您需要的一切，您可以接受這個答案。根據要求更新。

來源

2011-01-25 21:47:13 Spaceghost

>>> from lxml import html 
>>> table_html = """" 
...   <table border="2" width="100%"> 
...      <tbody><tr> 
...       <td width="33%" class="BoldTD">Website</td> 
...       <td width="33%" class="BoldTD">Last Visited</td> 
...       <td width="34%" class="BoldTD">Last Loaded</td> 
...      </tr> 
...      <tr> 
...       <td width="33%"> 
...       <a href="http://google.com"</a> 
...       </td> 
...       <td width="33%">01/14/2011 
...         </td> 
...       <td width="34%"> 
...         </td> 
...      </tr> 
...      <tr> 
...       <td width="33%"> 
...       <a href="http://stackoverflow.com"</a> 
...       </td> 
...       <td width="33%">01/10/2011 
...         </td> 
...       <td width="34%"> 
...         </td> 
...      </tr> 
...      </tbody></table>""" 
>>> table = html.fromstring(table_html) 
>>> for row in table.xpath('//table[@border="2" and @width="100%"]/tbody/tr'): 
...  for column in row.xpath('./td[position()=1]/a/@href | ./td[position()>1]/text() | self::node()[position()=1]/td/text()'): 
...    print column.strip(), 
...  print 
... 
Website Last Visited Last Loaded 
http://google.com 01/14/2011 
http://stackoverflow.com 01/10/2011 
>>>

瞧;）當然不是打印您可以添加值的嵌套列表或字跡;）

來源

2011-01-21 18:01:18 virhilo

感謝您的lxml實施！我還沒有檢查它，因爲我們的機器上沒有安裝lxml。一個管理員必須這樣做，等待:(我們可以將此代碼轉換爲BeautifulSoup進行快速檢查嗎？ – ThinkCode 2011-01-21 18:14:43

這是一個使用elementtree和極限的版本ED的XPath它提供：

from xml.etree.ElementTree import ElementTree 

doc = ElementTree().parse('table.html') 

for t in doc.findall('.//table'): 
    # there may be multiple tables, check we have the right one 
    if t.find('./tbody/tr/td').text == 'Website': 
    for tr in t.findall('./tbody/tr/')[1:]: # skip the header row 
     tds = tr.findall('./td') 
     print tds[0][0].attrib['href'], tds[1].text.strip(), tds[2].text.strip()

結果：

http://google.com 01/14/2011 
http://stackoverflow.com 01/10/2011

來源

2011-01-21 20:24:17 Spaceghost

請幫助解析這個HTML表格使用BeautifulSoup和lxml pythonic方式

回答

相關問題