2011-08-06 138 views
0

要在這裏開始是完整我當前的代碼:解析HTML表格

import urllib 
from BeautifulSoup import BeautifulSoup 
import sgmllib 
import re 

page = 'http://www.sec.gov/Archives/edgar/data/\ 
8177/000114036111018563/form10k.htm' 

sock = urllib.urlopen(page) 
raw = sock.read() 
soup = BeautifulSoup(raw) 

tablelist = soup.findAll('table') 

class MyParser(sgmllib.SGMLParser): 

def parse(self, segment): 
    self.feed(segment) 
    self.close() 

def __init__(self, verbose=0): 
    sgmllib.SGMLParser.__init__(self, verbose) 
    self.descriptions = [] 
    self.inside_td_element = 0 
    self.starting_description = 0 

def start_td(self, attributes): 
    for name, value in attributes: 
     if name == "valign": 
      self.inside_td_element = 1 
      self.starting_description = 1 
     else: 
      self.inside_td_element = 1 
      self.starting_description = 1 

def end_td(self): 
    self.inside_td_element = 0 

def handle_data(self, data): 
    if self.inside_td_element: 
     if self.starting_description: 
      self.descriptions.append(data) 
      self.starting_description = 0 
     else: 
      self.descriptions[-1] += data 

def get_descriptions(self): 
    return self.descriptions 

counter = 0 
trlist = [] 
dtablelist = [] 

while counter < len(tablelist): 
    trsegment = tablelist[counter].findAll('tr') 
    trlist.append(trsegment) 
    strsegment = str(trsegment) 
    myparser = MyParser() 
    myparser.parse(strsegment) 
    sub = myparser.get_descriptions() 
    dtablelist.append(sub) 
    counter = counter + 1 

ex = [] 

dtablelist = [s for s in dtablelist if s != ex] 

所以我想要完成的任務是採取從HTML文檔中的所有表,然後重新打印到一個Excel電子表格。所以,當我創建trlist輸出看起來是這樣的:

print trlist[1] 
[<tr> 
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT- SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">&#160;</font></td> 
<td valign="top" width="25%"> 
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Title of each class</font></div> 
</td> 
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Name of exchange</font></td> 
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">&#160;</font></td> 
</tr>, <tr> 
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">&#160;</font></td> 
<td valign="top" width="25%"> 
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="DISPLAY: inline; FONT-WEIGHT: bold">Common Stock, par value</font> </font></div> 
</td> 
<td valign="top" width="25%"> 
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"> 
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="FONT-WEIGHT: bold"><font style="FONT-WEIGHT: bold"><  <font style="FONT-WEIGHT: bold">NASDAQ Global Market</font></font></font></font></div> 
</div> 
</td> 
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">&#160;</font></td> 
</tr>,... 

正如你可以看到trlist每個產品每個單排桌子的這是我想要的(。)。但是,當我通過我的sgmllib中解析器來檢索標籤之間的內容運行每個trlist項目我得到這個輸出:

print dtablelist[1] 
['\nTitle of each class\n', 'Name of exchange', '\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n', '\n$1.00 per share\n'] 

正如你可以看到,輸出是每個內容作爲自己個人的字符串,而不是每個表格行()的內容列表。所以基本上我想要的輸出:

[['\nTitle of each class\n', 'Name of exchange'], ['\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n'], ['\n$1.00 per share\n']] 

是不是因爲我必須把trlist成字符串之前,我與MyParser解析呢?有誰知道任何解決方法,讓我解析列表內的列表(又名先知狗屎)?

+0

爲什麼你使用兩個不同的解析器,而不是使用BeautifulSoup的整個事情? (你爲什麼要兩次導入BeautifulSoup?) – kindall

+0

導入BeautifulSoup兩次是一個錯誤。此外,我正在使用sgmllib來解析,因爲當我這樣做時:trsegment = tablelist [counter] .findAll('tr')。這將返回一個列表類型輸出,而不是標籤或BeautifulSoup類型的輸出。 – kr21

回答

2

使用lxml.html

>>> import lxml.html 
>>> data = ["<tr><td>test</td><td>help</td></tr>", "<tr><td>data1</td><td>data2</td></tr>"] 
>>> [lxml.html.fromstring(tr).xpath(".//text()") for tr in data] 
[['test', 'help'], ['data1', 'data2']] 

這裏是一些更完整的代碼。它將文本存儲在包含表格列表的列表中,每個表格都有一個tr列表,每個tr都有一個所有文本的列表。

import urllib 
import lxml.html 

data = urllib.urlopen('http://www.sec.gov/Archives/edgar/data/8177/000114036111018563/form10k.htm').read() 
tree = lxml.html.fromstring(data) 

tables = [] 
for tbl in tree.iterfind('.//table'): 
    tele = [] 
    tables.append(tele) 
    for tr in tbl.iterfind('.//tr'): 
     text = [e.strip() for e in tr.xpath('.//text()') if len(e.strip()) > 0] 
     tele.append(text) 

print tables 

希望這會有所幫助,歡呼!

+0

是的,這正是我所期待的,非常感謝! – kr21

1

如果有人搜索相同問題的解決方案,而是使用Python 3:

您不必使用外部庫,即使您正在使用python 3.在解析HTML表SGMLParser類被html.parser替換爲HTMLParser。我已經編寫了一個簡單派生的HTMLParser類的代碼。它是here in a github repo。它只記得當前<td>,<tr><table>標籤的範圍。與使用etree相比,它的優勢在於它可以在不符合xml規範的html上正確運行,並且不會使用外部庫。

您可以使用類(這裏命名HTMLTableParser)方式如下:

import urllib.request 
from html_table_parser import HTMLTableParser 

target = 'http://www.twitter.com' 

# get website content 
req = urllib.request.Request(url=target) 
f = urllib.request.urlopen(req) 
xhtml = f.read().decode('utf-8') 

# instantiate the parser and feed it 
p = HTMLTableParser() 
p.feed(xhtml) 
print(p.tables) 

的這個輸出是代表表2D-列表的列表。它看起來可能是這樣的:

[[[' ', ' Anmelden ']], 
[['Land', 'Code', 'Für Kunden von'], 
    ['Vereinigte Staaten', '40404', '(beliebig)'], 
    ['Kanada', '21212', '(beliebig)'], 
    ... 
    ['3424486444', 'Vodafone'], 
    [' Zeige SMS-Kurzwahlen für andere Länder ']]]