用Python打印UTF-8字符2.7

這裏是我如何打開，讀取和輸出。該文件是unicode字符的UTF-8編碼文件。我想打印前10個UTF-8字符，但是從下面的代碼片斷輸出打印10個奇怪的無法識別的字符。想知道如果有人有任何想法如何正確打印？謝謝。用Python打印UTF-8字符2.7

with open(name, 'r') as content_file: 
     content = content_file.read() 
     for i in range(10): 
      print content[i]

每10怪異的性格是這樣的，

�

問候，林

來源

2016-07-29 Lin Ma

份額的文本文件內容 –

控制檯或TTY必須支持的字符，以及 - 你可能需要更改終端設置。 – cdarke

@cdarke，感謝和投票了。我的控制檯可以正確地打印內容，這應該證明它支持UTF-8字符。這個問題只發生在我打印'content [i]'的時候。如果你有任何想法，那將會很棒。 –

當Unicode代碼點（字符）被編碼成UTF-8的一些編碼點是轉換爲單個字節，但許多代碼點變成多個字節。標準7位ASCII範圍中的字符將被編碼爲單個字節，但更奇特的字符通常需要更多字節進行編碼。

因此，您正在分辨那些奇怪的字符，因爲您將這些多字節的UTF-8序列分解爲單個字節。有時這些字節將對應於正常的可打印字符，但通常它們不會讓您印刷。

下面是使用©，®和™字符的簡短演示，它們分別以UTF-8編碼爲2個，2個和3個字節。我的終端設置爲使用UTF-8。

utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2" 
print utfbytes, len(utfbytes) 
for b in utfbytes: 
    print b, repr(b) 

uni = utfbytes.decode('utf-8') 
print uni, len(uni)

輸出

© ® ™ 9                                   
� '\xc2'                                  
� '\xa9'                                  
    ' ' 
� '\xc2' 
� '\xae' 
    ' ' 
� '\xe2' 
� '\x84' 
� '\xa2' 
© ® ™ 5

堆棧溢出的聯合創始人，喬爾Spolsky的，已經寫在統一的好文章：The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

你也應該看看在Unicode HOWTO文章Python文檔和Ned Batchelder的Pragmatic Unicode文章，又名「Unipain」。

下面是從UTF-8編碼的字節字符串中提取單個字符的簡短示例。正如我在評論中提到的，要正確地做到這一點，您需要知道每個字符被編碼爲多少個字節。

utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2" 
widths = (2, 1, 2, 1, 3) 
start = 0 
for w in widths: 
    print "%d %d [%s]" % (start, w, utfbytes[start:start+w]) 
    start += w

輸出

0 2 [©] 
2 1 [ ] 
3 2 [®] 
5 1 [ ] 
6 3 [™]

FWIW，這裏是一個Python 3版本代碼：

utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2" 
widths = (2, 1, 2, 1, 3) 
start = 0 
for w in widths: 
    s = utfbytes[start:start+w] 
    print("%d %d [%s]" % (start, w, s.decode())) 
    start += w

如果我們不知道的人物在我們的UTF字節寬度-8字符串，那麼我們需要做更多的工作。每個UTF-8序列在第一個字節中編碼序列的寬度，如the Wikipedia article on UTF-8中所述。

以下Python 2演示顯示瞭如何提取寬度信息;它會產生與前兩個片段相同的輸出。

# UTF-8 code widths 
#width starting byte 
#1 0xxxxxxx 
#2 110xxxxx 
#3 1110xxxx 
#4 11110xxx 
#C 10xxxxxx 

def get_width(b): 
    if b <= '\x7f': 
     return 1 
    elif '\x80' <= b <= '\xbf': 
     #Continuation byte 
     raise ValueError('Bad alignment: %r is a continuation byte' % b) 
    elif '\xc0' <= b <= '\xdf': 
     return 2 
    elif '\xe0' <= b <= '\xef': 
     return 3 
    elif '\xf0' <= b <= '\xf7': 
     return 4 
    else: 
     raise ValueError('%r is not a single byte' % b) 


utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2" 
start = 0 
while start < len(utfbytes): 
    b = utfbytes[start] 
    w = get_width(b) 
    s = utfbytes[start:start+w] 
    print "%d %d [%s]" % (start, w, s) 
    start += w

一般來說，它應該不有必要做這樣的事情：只使用所提供的解碼方法。

對於好奇，這裏是一個Python 3版本的get_width，以及解碼UTF-8手動字節字符串的函數。

def get_width(b): 
    if b <= 0x7f: 
     return 1 
    elif 0x80 <= b <= 0xbf: 
     #Continuation byte 
     raise ValueError('Bad alignment: %r is a continuation byte' % b) 
    elif 0xc0 <= b <= 0xdf: 
     return 2 
    elif 0xe0 <= b <= 0xef: 
     return 3 
    elif 0xf0 <= b <= 0xf7: 
     return 4 
    else: 
     raise ValueError('%r is not a single byte' % b) 

def decode_utf8(utfbytes): 
    start = 0 
    uni = [] 
    while start < len(utfbytes): 
     b = utfbytes[start] 
     w = get_width(b) 
     if w == 1: 
      n = b 
     else: 
      n = b & (0x7f >> w) 
      for b in utfbytes[start+1:start+w]: 
       if not 0x80 <= b <= 0xbf: 
        raise ValueError('Not a continuation byte: %r' % b) 
       n <<= 6 
       n |= b & 0x3f 
     uni.append(chr(n)) 
     start += w 
    return ''.join(uni) 


utfbytes = b'\xc2\xa9 \xc2\xae \xe2\x84\xa2' 
print(utfbytes.decode('utf8')) 
print(decode_utf8(utfbytes))

輸出

©®™
©®™

來源

2016-07-29 07:34:57

感謝PM 2Ring，爲您的答覆投票。試過你的方法工作很好。還有一個問題，如果原始字符串同時具有Unicode字符（例如中文/日文字符）並且在UTF-8編碼的同一字符串中也有英文字符，假設第一個字符是中文字符，第二個字符是ASCII字母'a'（都是UTF-8編碼）。在我調用'utfbytes.decode（'utf-8'）'後，當我用'utfbytes [1]'引用第二個字符時，它能夠正確識別'a'嗎？ –

（續）自從您提到了多字節和單字節字符以來，我有這種困惑，我想知道如果混合使用原始的UTF-8編碼字符串，它們將如何工作。謝謝。 –

順便說一句，我試過提及'utfbytes [1]'時，'a'可以正確輸出，只是爲了確認我的理解是正確的。謝謝。 –

要輸出一個Unicode字符串到文件或控制檯，你需要選擇一個文本編碼。在Python默認的文本編碼是ASCII，而是支持其他字符，你需要使用不同的編碼，如UTF-8：

s = unicode(your_object).encode('utf8') 
print s

來源

2016-07-29 06:41:19

感謝U.Swap，投票了。我應該用'content'來代替'your_object'嗎？ –

用Python打印UTF-8字符2.7

回答

相關問題