I asked about using BeautifulSoup to parse a very large (270MB) HTML file and getting a memory error andwas pointed toward ElementTree as a solution.解析與Python非常大的HTML文件(ElementTree的?)
我試圖用自己的事件驅動的解析,documented here。使用較小的設置文件進行測試可以正常工作:
>>> settings = open('S:\\Documents\\FacebookData\\html\\settings.htm')
>>> for event, element in ET.iterparse(settings, events=("start", "end")):
print("%5s, %4s, %s" % (event, element.tag, element.text))
成功打印出元素。然而,使用具有代替「settings.htm」 messages.htm'是相同的代碼只是爲了看看它的工作,甚至在開始實際的編碼過程之前,這是結果:
Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
for event, element in ET.iterparse(source, events=("start", "end")):
File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1294, in __next__
for event in self._parser.read_events():
File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1277, in read_events
raise event
File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1235, in feed
self._parser.feed(data)
File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 6
我想知道如果這是因爲ET更適合解析XML文檔?如果是這種情況,並且沒有解決方法,那我就回到原點。任何關於如何解析這個文件的建議,以及如何一路調試將不勝感激!
嘗試從LXML的HTML解析器。 – Daniel
[迭代解析HTML(使用lxml?)](http://stackoverflow.com/questions/8477627/iteratively-parsing-html-with-lxml) – har07