Python無法將錯誤的Unicode編碼爲ASCII

我有一些Python代碼，它接收到一個字符串與壞unicode。當我嘗試忽略不良字符時，Python仍然扼殺（版本2.6.1）。以下是如何重現它：Python無法將錯誤的Unicode編碼爲ASCII

s = 'ad\xc2-ven\xc2-ture' 
s.encode('utf8', 'ignore')

它拋出

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)

我在做什麼錯？

來源

2011-05-25 Eric Palakovich Carr

您確定不需要s.decode（'utf8'，'ignore'）嗎？ – Dan 2011-05-25 13:08:51

是的，你說得對。哎呀:) – 2011-05-25 13:24:37

將字符串轉換爲Unicode實例是在Python 2.x的str.decode()：

>>> s.decode("ascii", "ignore") 
u'ad-ven-ture'

來源

2011-05-25 13:09:40

請注意，使用OP的編碼（utf-8）而不是ASCII碼，您將獲得'u'adventure''。我更喜歡'unicode（utf8_string，'utf-8'，'ignore'）'，因爲它更清晰地創建了一個unicode字符串。 – 2011-05-25 14:56:13

還有's.decode（'ascii'，'replace'）'可以用來了解這些問題。 – Wernight 2012-10-24 13:51:28

你是混亂的「統一」和「UTF-8」。您的字符串s不是unicode;它是特定編碼中的字節串（但不是UTF-8，更可能是iso-8859-1等）。從字節串到unicode由解碼數據，而不是編碼爲。從unicode到bytestring是編碼。也許你的意思是讓s一個unicode字符串：

>>> s = u'ad\xc2-ven\xc2-ture' 
>>> s.encode('utf8', 'ignore') 
'ad\xc3\x82-ven\xc3\x82-ture'

或者你要處理的字節字符串爲UTF-8，但忽略無效的序列，在這種情況下，你會解碼與「忽略」爲的字節字符串錯誤處理程序：

>>> s = 'ad\xc2-ven\xc2-ture' 
>>> u = s.decode('utf-8', 'ignore') 
>>> u 
u'adventure' 
>>> u.encode('utf-8') 
'adventure'

來源

2011-05-25 13:09:54

Python無法將錯誤的Unicode編碼爲ASCII

回答

相關問題