我已經生成了一個巨大的(50MB)XML ElementTree,並且在原始數據中的某個地方有一些UTF-8字母沒有被去掉。即使在tostring中有一個「encoding ='UTF-8'」選項,ElementTree.write和.tostring似乎也會在unicode上窒息。文檔相當有限,我甚至不確定tostring是UTF-8友好的(查看源代碼)。如何用UTF-8編寫ElementTree
所以我的問題 - 我如何去掉這個非ASCII字符的整個元素樹,所以我可以把這個怪物寫到磁盤上(花費8個小時來生成)?我現在已經醃製過了。我還使用了一種叫做latin1_to_ascii上大部分的數據功能:
def latin1_to_ascii(unicrap):
"""
This takes a UNICODE string and replaces Latin-1 characters with
something equivalent in 7-bit ASCII. Anything not converted is deleted.
#the unicode hammer approach: http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
"""
xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/',0x92:'a'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += str(i)
return r
說,「核選項」功能僅適用於字符串,現在我有一個元素的數據我似乎無法剝奪的東西我錯過了。
8小時?你使用'xml.etree.ElementTree'還是'xml.etree.cElementTree'?可能是一個非常高效的關鍵筆劃... –