該代碼需要一些不良的html,使用Tidy庫進行清理,然後將其傳遞給HtmlLib.Reader()。Python - 整理HTML解析
import tidy
options = dict(output_xhtml=1,
add_xml_decl=1,
indent=1,
tidy_mark=0)
from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()
doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))
我不是通過用正確的類型,似乎fromString,這種回溯:
Traceback (most recent call last):
File "getComicEmbed.py", line 33, in <module>
doc = reader.fromString(tidy.parseString("<Html>Bad Html.</b>", **options))
File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 67, in fromString
stream = reader.StrStream(str)
File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\__init__.py", line 24, in StrStream
return cStringIO.StringIO(st)
TypeError: expected read buffer, _Document found
我應該怎麼做不同?謝謝!
tidy'模塊在導入哪個'? PyPI顯示至少兩個,我不確定是否包含在'tidy'源代碼分發包中的那個(對於ubuntu的'tidy'包)是其中之一。 – intuited 2010-10-15 09:55:55