用python和lxml忽略xml中的unicode？

我正在尋找或者忽略我的xml中的unicode。我願意以某種方式在輸出處理中改變它。用python和lxml忽略xml中的unicode？

我的Python：

import urllib2, os, zipfile 
from lxml import etree 

doc = etree.XML(item) 
docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()')) 
target = doc.xpath('//references-cited/citation/nplcit/*/text()') 
#target = '-'.join(target).replace('\n-','') 
print "docID: {0}\nCitation: {1}\n".format(docID,target) 
outFile.write(str(docID) +"|"+ str(target) +"\n")

創建的輸出：

docID: US-D0607176-S1-20100105 
Citation: [u"\u201cThe birth of Lee Min Ho's donuts.\u201d Feb. 25, 2009. Jazzholic. Apr. 22, 2009 <http://www

但是，如果我嘗試在'-'join(target).replace('\n-','')加回我得到這個錯誤都print和outFile.write：

Traceback (most recent call last): 
    File "C:\Documents and Settings\mine\Desktop\test_lxml.py", line 77, in <module> 
    print "docID: {0}\nCitation: {1}\n".format(docID,target) 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)

我該如何忽略unicode，這樣我就可以輸出target與outFile.write？

來源

2012-03-12 Hola Sir

當你從__future__導入unicode_literals時會發生什麼？ – 2012-03-12 21:20:38

你會得到這個錯誤，因爲你有一個帶有unicode字符的字符串，你試圖使用ascii字符集輸出。當打印清單時，你會得到清單中的'repr'和其中的字符串，從而避免了這個問題。

您需要編碼到不同的字符集（例如UTF-8），或者在編碼時去除或替換無效字符。

我推薦閱讀Joels The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)，後面跟着編碼和解碼字符串的相關章節the Python docs。

這裏有一個小提示，讓你開始：

print "docID: {0}\nCitation: {1}\n".format(docID.encode("UTF-8"), 
               target.encode("UTF-8"))

來源

2012-03-12 21:38:42 Epcylon

是的 - 當我剛剛將'.encode（「UTF-8」）'添加到'print'和'write'輸出代碼時，這工作。非常感謝！ – 2012-03-12 21:52:40

print "docID: {0}\nCitation: {1}\n".format(docID.encode("utf-8"), target.encode("utf-8"))

所有不在ASCII字符集的字符會顯示爲十六進制轉義序列：例如「\ u201c」將顯示爲「\ xe2 \ x80 \ x9c」。如果這是不可接受的，那麼你可以做：「」

docID = "".join([a if ord(a) < 128 else '.' for a in x])

，這將有取代所有非ASCII字符。

來源

2012-03-12 21:39:31

用python和lxml忽略xml中的unicode？

回答

相關問題