如何使用ElementTree正確解析utf-8 xml？

我需要幫助，以瞭解爲什麼解析我的xml文件*與xml.etree.ElementTree會產生以下錯誤。如何使用ElementTree正確解析utf-8 xml？

* 我的測試xml文件包含阿拉伯字符。

任務： 打開並解析utf8_file.xml文件。

我第一次嘗試：

import xml.etree.ElementTree as etree 
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file: 
    xml_tree = etree.parse(utf8_file)

結果1：

UnicodeEncodeError: 'ascii' codec can't encode characters in position 236-238: ordinal not in range(128)

我的第二次嘗試：

import xml.etree.ElementTree as etree 
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file: 
    xml_string = etree.tostring(utf8_file, encoding='utf-8', method='xml') 
    xml_tree = etree.fromstring(xml_string)

結果2：

AttributeError: 'file' object has no attribute 'getiterator'

請解釋一下上面的錯誤，並在可能的解決方案發表意見。

來源

2014-02-11 minerals

將解碼字節留給解析器;做不解碼第一：

import xml.etree.ElementTree as etree 
with open('utf8_file.xml', 'r') as xml_file: 
    xml_tree = etree.parse(xml_file)

的XML文件必須包含在第一線處理解析器解碼的足夠信息。如果標題丟失，解析器必須假定使用了UTF-8。

因爲它是保存此信息的XML標頭，所以解析器有責任進行所有解碼。

您的第一次嘗試失敗，因爲Python試圖再次編碼 Unicode值，以便解析器可以按預期處理字節字符串。第二次嘗試失敗，因爲etree.tostring()需要解析樹作爲第一個參數，而不是unicode字符串。

來源

2014-02-11 09:41:03

優秀，它似乎比我想象的更容易。即使「沒有BOM」的「utf-8」文件也能正確解析。 – minerals

沒有BOM的UTF-8是標準; *帶* BOM主要是微軟希望能夠更容易地自動檢測UTF-8以外的8位編碼。 –

'etree.parse（a_file）'默認處理Unicode。然而'etree.fromstring（a_string）'直到Python 3.x（請參閱http://bugs.python.org/issue11033）才能解析字符串，所以必須手動對其進行編碼，如'etree.fromstring（ a_string.encode（ 'UTF-8'））'。 –

如何使用ElementTree正確解析utf-8 xml？

回答

相關問題