我有一個xml file。請下載並保存爲blog.xml
。 這是我的文件在谷歌博客的列表,我寫了一些代碼來解析它,有一些與lxml扭曲的東西。如何處理lxml中的編碼以正確解析html-string?
代碼1:
from stripogram import html2text
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value'].encode("utf-8")
print html2text(string)
它獲得與編碼1一個正確的結果。
碼2:
import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value']
myhtml=lxml.html.document_fromstring(string)
print myhtml.text_content()
它獲得與CODE2一個錯誤的輸出。
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82659)
ValueError: Unicode strings with encoding declaration are not supported.
CODE3:
import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value'].encode("utf-8")
myhtml=lxml.html.document_fromstring(string)
print myhtml.text_content()
它獲得與CODE3一個錯誤的輸出。
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82748)
File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81546)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78216)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 599, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74827)
lxml.etree.XMLSyntaxError: line 1395: Tag b:include invalid
如何處理lxml中的編碼以正確解析html-string?
我懷疑在這些條目中*有*解析錯誤,但是lxml在錯誤的位置忽略了該異常。 Python C-API異常處理需要代碼檢查某些點的異常,如果沒有完成,那麼當另一個異常發生* *得到正確處理時,異常會在*之後*突然出現。如果你省略了第一個「測試」電話會發生什麼?他是否與XMLSyntaxError一樣? – 2013-04-16 08:13:39
無論如何,這肯定應該報告給LXML項目。 – 2013-04-16 08:14:27
@Martijn Pieters:是的,同樣的錯誤發生了,第一個'test'調用只是爲了顯示'XMLSyntaxError'消息在解析'e'後發生了變化。 – gatto 2013-04-16 10:20:09