如何正確製表Unicode數據

我有這樣的測試：

# -*- coding: utf-8 -*- 

import binascii 

test_cases = [ 
    'aaaaa', # Normal bytestring 
    'ááááá', # Normal bytestring, but with extended ascii. Since the file is utf-8 encoded, this is utf-8 encoded 
    'ℕℤℚℝℂ', # Encoded unicode. The editor has encoded this, and it is defined as string, so it is left encoded by python 
    u'aaaaa', # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file 
    u'ááááá', # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file 
    u'ℕℤℚℝℂ', # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file 
] 
FORMAT = '%-20s -> %2d %-20s %-30s %-30s' 
for data in test_cases : 
    try: 
     hexlified = binascii.hexlify(data) 
    except: 
     hexlified = None 
    print FORMAT % (data, len(data), type(data), hexlified, repr(data))

它產生的輸出：

aaaaa    -> 5 <type 'str'>   6161616161      'aaaaa'      
ááááá   -> 10 <type 'str'>   c3a1c3a1c3a1c3a1c3a1   '\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1' 
ℕℤℚℝℂ  -> 15 <type 'str'>   e28495e284a4e2849ae2849de28482 '\xe2\x84\x95\xe2\x84\xa4\xe2\x84\x9a\xe2\x84\x9d\xe2\x84\x82' 
aaaaa    -> 5 <type 'unicode'>  6161616161      u'aaaaa'      
ááááá    -> 5 <type 'unicode'>  None       u'\xe1\xe1\xe1\xe1\xe1'  
ℕℤℚℝℂ    -> 5 <type 'unicode'>  None       u'\u2115\u2124\u211a\u211d\u2102'

正如你所看到的，對於非ASCII字符的字符串，列沒有正確對齊。這是因爲這些字符串的長度（以字節爲單位）大於unicode字符的數量。如何告訴打印人員考慮字符數量，而不是填充字段時的字節數？

來源

2013-12-18 dangonfast

首先使用字符而不是字節。 –

當python 2.7看到'ℕℤℚℝℂ'時，它顯示「這裏有15個任意字節值」。它不知道它們代表什麼字符，也不知道它們代表它們的編碼。你需要這個字節的字符串解碼成unicode字符串，指定編碼，然後才能指望蟒蛇能算個字符：

for data in test_cases : 
    if isinstance(data, bytes): 
     data = data.decode('utf-8') 
    print FORMAT % (data, len(data), type(data), repr(data))

注意，除了在Python 3，所有字符串文字默認情況下unicode對象

來源

2013-12-18 10:25:35 Eric

如何正確製表Unicode數據

回答

相關問題