迭代解析HTML（使用lxml？）

我正在嘗試迭代地解析一個非常大的HTML文檔（我知道.. yuck）以減少使用的內存量。我遇到的問題是，我得到XML語法錯誤，如：迭代解析HTML（使用lxml？）

lxml.etree.XMLSyntaxError: Attribute name redefined, line 134, column 59

這進而導致一切停止。

有沒有一種方法來迭代地解析HTML而不會嗆到語法錯誤？

此刻，我從XML語法錯誤異常中提取行號，從文檔中刪除該行，然後重新啓動該過程。看起來像一個非常噁心的解決方案。有沒有更好的辦法？

編輯：

這是目前我在做什麼：

context = etree.iterparse(tfile, events=('start', 'end'), html=True) 
in_table = False 
header_row = True 
while context: 
    try: 
     event, el = context.next() 

     # do something 

     # remove old elements 
     while el.getprevious() is not None: 
      del el.getparent()[0] 

    except etree.XMLSyntaxError, e: 
     print e.msg 
     lineno = int(re.search(r'line (\d+),', e.msg).group(1)) 
     remove_line(tfilename, lineno) 
     tfile = open(tfilename) 
     context = etree.iterparse(tfile, events=('start', 'end'), html=True) 
    except KeyError: 
     print 'oops keyerror'

來源

2011-12-12 Acorn

完美的解決方案最終成爲了Python自己的HTMLParser^[docs]。

這是（很糟糕）的代碼，我最終使用：

class MyParser(HTMLParser): 
    def __init__(self): 
     self.finished = False 
     self.in_table = False 
     self.in_row = False 
     self.in_cell = False 
     self.current_row = [] 
     self.current_cell = '' 
     HTMLParser.__init__(self) 

    def handle_starttag(self, tag, attrs): 
     attrs = dict(attrs) 
     if not self.in_table: 
      if tag == 'table': 
       if ('id' in attrs) and (attrs['id'] == 'dgResult'): 
        self.in_table = True 
     else: 
      if tag == 'tr': 
       self.in_row = True 
      elif tag == 'td': 
       self.in_cell = True 
      elif (tag == 'a') and (len(self.current_row) == 7): 
       url = attrs['href'] 
       self.current_cell = url 


    def handle_endtag(self, tag): 
     if tag == 'tr': 
      if self.in_table: 
       if self.in_row: 
        self.in_row = False 
        print self.current_row 
        self.current_row = [] 
     elif tag == 'td': 
      if self.in_table: 
       if self.in_cell: 
        self.in_cell = False 
        self.current_row.append(self.current_cell.strip()) 
        self.current_cell = '' 

     elif (tag == 'table') and self.in_table: 
      self.finished = True 

    def handle_data(self, data): 
     if not len(self.current_row) == 7: 
      if self.in_cell: 
       self.current_cell += data

與該代碼然後我可以這樣做：

parser = MyParser() 
for line in myfile: 
    parser.feed(line)

來源

2011-12-13 04:18:23 Acorn

-1

嘗試用lxml.html解析HTML文檔：

從版本2.0，LXML自帶專門用於處理HTML的Python包：lxml.html。它基於lxml的HTML解析器，但爲HTML元素提供了一個特殊的Element API，以及一些用於常見HTML處理任務的實用程序。

來源

2011-12-12 16:51:47

我試圖反覆解析文檔，由於其大尺寸。據我所知，lxml.html沒有iterparse函數。 – Acorn

我建議lxml.html，因爲在OP中沒有提及嘗試lxml.html。我認爲對我的答覆進行低估是錯誤的。 –

使用True爲iterparse的論點html和huge_tree。

來源

2011-12-12 17:09:52 Kabie

我目前使用'html = True'，它仍然會引發XML語法錯誤。我會看看'huge_tree'參數。 – Acorn

'huge_tree'似乎並不相關：「huge_tree：禁用安全限制並支持非常深的樹」。我的樹不深，只是很長。 – Acorn

目前LXML etree.iterparse supports keyword argument recover=True，所以而不是編寫HTMLParser的自定義子類來修復已損壞的HTML，您可以將此參數傳遞給iterparse。

正確地解析龐大而損壞的HTML你只需要做到以下幾點：

etree.iterparse(tfile, events=('start', 'end'), html=True, recover=True)

來源

2015-08-17 11:52:16

這是我最好的答案。 –

迭代解析HTML（使用lxml？）

回答

相關問題