這些方法處理Python中的Unicode字符串有什麼區別？

我試過print a_str.decode("utf-8")，print uni_str，print uni_str.decode("utf-8")，print uni_str.encode("utf-8") ..這些方法處理Python中的Unicode字符串有什麼區別？

但只有第一個作品。

>>> print '\xe8\xb7\xb3'.decode("utf-8") 
跳 
>>> print u'\xe8\xb7\xb3\xe8' 
è·³è 
>>> print u'\xe8\xb7\xb3\xe8'.decode("utf-8") 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode 
    return codecs.utf_8_decode(input, errors, True) 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) 
>>> print u'\xe8\xb7\xb3\xe8'.encode("utf-8") 
è·³è

我真的很困惑如何正常顯示Unicode字符串。如果我有這樣的字符串： a=u'\xe8\xb7\xb3\xe8'，如何打印a？

來源

2012-08-05 Hanfei Sun

u'\ xe8 \ xb7 \ xb3 \ xe8'是è·³è;你爲什麼期望它打印其他東西？跳過的形式是u'\ u8df3'。 – prosfilaes 2012-08-05 07:18:15

你的第一個是正確的。那個有什麼問題？ – BrenBarn 2012-08-05 07:18:44

如果你有一個這樣的字符串，那麼它已經壞了。您需要將其編碼爲Latin-1，將其轉換爲具有相同字節值的字符串，然後解碼爲UTF-8。

來源

2012-08-05 07:09:16

它沒有損壞。當我嘗試''\ xe8 \ xb7 \ xb3'.decode（「utf-8」）'時，結果會是'跳'，這正是我想要的。但問題是：現在它變成了'u'\ xe8 \ xb7 \ xb3''而不是''xe8 \ xb7 \ xb3''。我該如何恢復它。 – 2012-08-05 07:12:43

'\ xe8 \ xb7 \ xb3 \ xe8'無效UTF-8。最後的\ xe8是無關緊要的。 – 2012-08-05 07:26:04

@ Kenji：好的。但那不是我的問題。也可能不是全部數據。 – 2012-08-05 07:59:34

'\xe8\xb7\xb3'是中國特色與utf8編碼，所以'\xe8\xb7\xb3'.decode('utf-8')做工精細，返回的跳，u'\u8df3'的Unicode值。但u'\xe8\xb7\xb3'是一個字面unicode字符串，它與跳的unicode不同。而一個unicode字符串不能是decoded，它是unicode。最後， ~~a=u'\xe8\xb7\xb3\xe8'真的不是一個有效的Unicode字符串~~ [1]。

u'\xe8\xb7\xb3'來自哪裏？另一個功能？

[1]查看第一條評論。

來源

2012-08-05 07:46:23 xiaowl

不，它是一個有效的Unicode字符串。這僅僅是asker尋找的Unicode字符串。 – 2012-08-05 08:00:11

unicode字符串u'\xe8\xb7\xb3\xe8'相當於u'\u00e8\u00b7\u00b3\u00e8'。你想要的是u'\u8df3'，它可以用utf8編碼爲'\xe8\xb7\xb3'。

在Python中，unicode是一個UCS-2字符串（構建選項）。所以，u'\xe8\xb7\xb3\xe8'是4個16位Unicode字符的字符串。

如果你有一個UTF8字符串（8位字符串）錯誤地呈現爲Unicode（16位字符串），你必須把它轉換爲8位字符串第一：

>>> ''.join([chr(ord(a)) for a in u'\xe8\xb7\xb3']).decode('utf8') 
u'\u8df3'

注意'\xe8\xb7\xb3\xe8'是無效的UTF8字符串因爲最後一個字節'\xe8'是兩個字節序列的第一個字符，不能終止utf8字符串。

來源

2012-08-05 08:48:57

這些方法處理Python中的Unicode字符串有什麼區別？

回答

相關問題