2013-12-18 32 views
2

(我是蟒蛇2.7)如何正確製表Unicode數據

我有這樣的測試:

# -*- coding: utf-8 -*- 

import binascii 

test_cases = [ 
    'aaaaa', # Normal bytestring 
    'ááááá', # Normal bytestring, but with extended ascii. Since the file is utf-8 encoded, this is utf-8 encoded 
    'ℕℤℚℝℂ', # Encoded unicode. The editor has encoded this, and it is defined as string, so it is left encoded by python 
    u'aaaaa', # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file 
    u'ááááá', # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file 
    u'ℕℤℚℝℂ', # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file 
] 
FORMAT = '%-20s -> %2d %-20s %-30s %-30s' 
for data in test_cases : 
    try: 
     hexlified = binascii.hexlify(data) 
    except: 
     hexlified = None 
    print FORMAT % (data, len(data), type(data), hexlified, repr(data)) 

它產生的輸出:

aaaaa    -> 5 <type 'str'>   6161616161      'aaaaa'      
ááááá   -> 10 <type 'str'>   c3a1c3a1c3a1c3a1c3a1   '\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1' 
ℕℤℚℝℂ  -> 15 <type 'str'>   e28495e284a4e2849ae2849de28482 '\xe2\x84\x95\xe2\x84\xa4\xe2\x84\x9a\xe2\x84\x9d\xe2\x84\x82' 
aaaaa    -> 5 <type 'unicode'>  6161616161      u'aaaaa'      
ááááá    -> 5 <type 'unicode'>  None       u'\xe1\xe1\xe1\xe1\xe1'  
ℕℤℚℝℂ    -> 5 <type 'unicode'>  None       u'\u2115\u2124\u211a\u211d\u2102' 

正如你所看到的,對於非ASCII字符的字符串,列沒有正確對齊。這是因爲這些字符串的長度(以字節爲單位)大於unicode字符的數量。如何告訴打印人員考慮字符數量,而不是填充字段時的字節數?

+1

首先使用字符而不是字節。 –

回答

3

當python 2.7看到'ℕℤℚℝℂ'時,它顯示「這裏有15個任意字節值」。它不知道它們代表什麼字符,也不知道它們代表它們的編碼。你需要這個字節的字符串解碼成unicode字符串,指定編碼,然後才能指望蟒蛇能算個字符:

for data in test_cases : 
    if isinstance(data, bytes): 
     data = data.decode('utf-8') 
    print FORMAT % (data, len(data), type(data), repr(data)) 

注意,除了在Python 3,所有字符串文字默認情況下unicode對象

相關問題