2013-07-10 54 views
2

我有一個ElementTree比如我試圖輸出使用tostring法文本:蟒蛇ElementTree的解碼錯誤

tostring(root, encoding='UTF-8') 

我得到了UnicodeDecodeError(以下回溯),因爲Element.text節點中的一個具有u'\u2014'個性。我設置text屬性如下:

my_str = u'\u2014' 
el.text = my_str.encode('UTF-8') 

我怎樣才能成功地序列化樹發送短信?我編碼的節點不正確?謝謝。

Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "crisis_app/converters/to_xml.py", line 129, in convert 
    return tostring(root, encoding='UTF-8') 
    File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1127, in tostring 
    ElementTree(element).write(file, encoding, method=method) 
    File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 821, in write 
    serialize(write, self._root, encoding, qnames, namespaces) 
    File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml 
    _serialize_xml(write, e, encoding, qnames, None) 
    File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml 
    _serialize_xml(write, e, encoding, qnames, None) 
    File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml 
    _serialize_xml(write, e, encoding, qnames, None) 
    File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 938, in _serialize_xml 
    write(_escape_cdata(text, encoding)) 
    File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1074, in _escape_cdata 
    return text.encode(encoding, "xmlcharrefreplace") 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 288: ordinal not in range(128) 
+2

有消息稱它試圖把它當作ASCII,不是UTF-8解碼。而且,0xE2似乎與0x2014(em-dash)沒有關係。 –

+1

我們可以看到更多的代碼嗎?看起來你的樹中有非Unicode文本,它使'text.encode()'首先將**解碼爲Unicode,然後再進行編碼。 –

+0

@JimGarrison是的,它確實相關,這是em-dash的utf-8表示:0xE2 0x80 0x94(e28094)0xE2是第一個字節。 http://www.fileformat.info/info/unicode/char/2014/index.htm –

回答

2

如果你這樣做:

my_str = u'\u2014' 
el.text = my_str.encode('UTF-8') 

你設置文本的Unicode字符的UTF-8編碼的版本。它與

el.text = '\xe2\x80\x94' 

現在你不再有一個Unicode字符,而是一系列的字節。

如果然後做:

tostring(root, encoding='UTF-8') 

你說你想編碼爲UTF-8的內容。爲此,在內部,首先使用默認編碼(ascii)將字符串解碼爲unicode,然後編碼爲utf-8,這當然會失敗,因爲字符串中的字節不在ascii範圍內。

ElementTree的是完全能夠與Unicode的工作,所以只要給它的Unicode而不是海峽的:

>>> from xml.etree import ElementTree as et 
>>> e = et.Element('test') 
>>> e.text = u'\u2014' 

>>> s = et.tostring(e) 
>>> print s, repr(s) 
<test>&#8212;</test> '<test>&#8212;</test>' 

>>> s = et.tostring(e, encoding='utf-8') 
>>> print s, repr(s) 
<test>—</test> '<test>\xe2\x80\x94</test>' 
+0

yea事實證明問題是我在所有情況下都調用'el.text = str(content)'來防止'content'是一個int。這是拋出錯誤,所以我的修復程序有一個邏輯錯誤,最終對輸出進行雙重編碼。 – aaronstacy