爲什麼Python lxml不能使用我的xml？

我正在使用Python lxml庫來解析我的xml，但我很難解析一個特定的文本。檢出以下代碼：爲什麼Python lxml不能使用我的xml？

>>> print type(raw_text_xml) 
<type 'unicode'> 
>>> from lxml import etree 
>>> article_xml_root = etree.fromstring(raw_text_xml, parser) 
Traceback (most recent call last): 
    File "<input>", line 1, in <module> 
    article_xml_root = etree.fromstring(raw_text_xml, parser) 
    File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121) 
    File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102470) 
    File "parser.pxi", line 1667, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:101229) 
    File "parser.pxi", line 1035, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:96139) 
    File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290) 
    File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476) 
    File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91772) 
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

所以它說的第一個字符不是<，其通過檢查爲真：

>>> print raw_text_xml[:20] 
ďťż<?xml version="1.

它在XML的前3個怪異字符。因此，要清理這些我試過如下：

>>> article_xml_root = etree.fromstring(raw_text_xml[3:], parser) 
Traceback (most recent call last): 
    File "<input>", line 1, in <module> 
    article_xml_root = etree.fromstring(raw_text_xml[3:], parser) 
    File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121) 
    File "parser.pxi", line 1781, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102435) 
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

而現在突然抱怨它是一個unicode字符串編碼聲明，而如果你看一路攀升到我的第一行代碼，這是Unicode的所有沿。

有人知道爲什麼切片後突然給出了一個完全不同的錯誤？最重要的是，有人知道我能如何解決這個問題嗎？

來源

2016-03-22 kramer65

添加您的xml請求。 –

why after slicing it suddenly gives a whole different error?

因爲切片後第一個錯誤消失，解析可以繼續，直到找到第二個錯誤。

And most importantly, does anybody know how I can solve this?

也許錯誤消息是正確的（它發生），你可以通過將unicode轉換爲字節來解決它。我想這比刪除編碼聲明更好。

raw_text_xml.encode('utf8')

或者，而不是'utf8'無論在xml片段中聲明哪種編碼。

來源

2016-03-22 14:58:28 Goyo

聽起來合法。你有轉換爲字節的建議嗎？ – kramer65

我試過'etree.fromstring（bytearray（raw_text_xml [3：]），parser）'，但是這給了我一個'TypeError：unicode參數，沒有編碼'。有任何想法嗎？ – kramer65

看我的編輯。 'bytearray'是另一回事。在Python 2中'bytes'是'str'的別名。 – Goyo

第一個錯誤是由錯誤的字符引起的。一旦你修好了，你落在第二，這就是你的raw_text_xml是unicode。

你可以知道什麼是適當的編碼（ASCII，拉丁文，utf8，...）。我不能沒有看到實際的內容。

假設它是encoding變量的內容，你應該能夠做到：

article_xml_root = etree.fromstring(raw_text_xml.encode(encoding), parser)

（但我強烈建議你先控制顯示print raw_text_xml[3:160] ...）

來源

2016-03-22 15:30:57

你在哪裏都解碼了原始的Unicode，它是不正確的。它看起來像iso-8859-2，它最初是帶有BOM簽名的UTF-8。以下內容將錯誤地解碼並重新解碼：

>>> s.encode('iso-8859-2').decode('utf-8-sig') 
'<?xml version="1.'

來源

2016-03-22 20:21:37

爲什麼Python lxml不能使用我的xml？

回答

相關問題