2012-04-06 20 views
9

我想寫使用ElementTree的這樣的UTF-8編碼數據的XML文件UTF-8的數據:編寫XML UTF-8的文件與ElementTree的

#!/usr/bin/python                  
# -*- coding: utf-8 -*-                 

import xml.etree.ElementTree as ET 
import codecs 

testtag = ET.Element('unicodetag') 
testtag.text = u'Töreboda' #The o is really ö (o with two dots over). No idea why SO dont display this 
expfile = codecs.open('testunicode.xml',"w","utf-8-sig") 
ET.ElementTree(testtag).write(expfile,encoding="UTF-8",xml_declaration=True) 
expfile.close() 

這打擊了錯誤

Traceback (most recent call last): 
    File "unicodetest.py", line 10, in <module> 
    ET.ElementTree(testtag).write(expfile,encoding="UTF-8",xml_declaration=True) 
    File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 815, in write 
    serialize(write, self._root, encoding, qnames, namespaces)  
    File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 932, in _serialize_xml 
    write(_escape_cdata(text, encoding)) 
    File "/usr/lib/python2.7/codecs.py", line 691, in write 
    return self.writer.write(data) 
    File "/usr/lib/python2.7/codecs.py", line 351, in write 
    data, consumed = self.encode(object, self.errors) 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128) 

使用「us-ascii」編碼可以正常工作,但不保留數據中的unicode字符。發生什麼事?

回答

17

codecs.open需要將Unicode字符串寫入文件對象,它將處理編碼爲UTF-8。在將它們發送到文件對象之前,ElementTree的write將Unicode字符串編碼爲UTF-8字節字符串。由於文件對象需要Unicode字符串,因此它使用默認的ascii編解碼器將字節串強制回到Unicode,並導致UnicodeDecodeError

只是這樣做:

#expfile = codecs.open('testunicode.xml',"w","utf-8-sig") 
ET.ElementTree(testtag).write('testunicode.xml',encoding="UTF-8",xml_declaration=True) 
#expfile.close() 
+2

+1。只是爲了澄清這一點:問題在於你試圖對unicode-> utf-8進行兩次編碼:ElementTree執行一次,然後編解碼器啓用的流嘗試再次執行它。但是由於第二次輸入已經被編碼,所以第二次輸入會變得困惑(它需要一個unicode字符串,而不是獲取utf-8編碼的字節字符串)。 – 2012-04-06 20:13:06

+0

在這裏,我一直在想我是通過提供一個unicode文件來幫助我...我只能說我喜歡stackoverflow? 3小時內完美答案!標記闡述也在解釋很多。 – c0m4 2012-04-06 20:57:05

+1

我一直在處理utf-8數據,並在嘗試寫入xml文件時收到了ElementTree._serialize_text()或_serialize_xml()中的類似錯誤。在將它們添加到我的ET.Element對象之前,我可以通過使用myString.decode('utf-8')將字符串轉換爲unicode來解決此問題。看來ET.ElementTree.write()對其他字符串編碼不滿意。 – drevicko 2012-07-17 14:49:21