2013-05-10 61 views
2

呃,Python的2/3的是如此令人沮喪......考慮這個例子,test.py在Python 2和Python 3中獲取相同的Unicode字符串長度?

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import sys 
if sys.version_info[0] < 3: 
    text_type = unicode 
    binary_type = str 
    def b(x): 
    return x 
    def u(x): 
    return unicode(x, "utf-8") 
else: 
    text_type = str 
    binary_type = bytes 
    import codecs 
    def b(x): 
    return codecs.latin_1_encode(x)[0] 
    def u(x): 
    return x 

tstr = " ▲ " 

sys.stderr.write(tstr) 
sys.stderr.write("\n") 
sys.stderr.write(str(len(tstr))) 
sys.stderr.write("\n") 

運行它:

$ python2.7 test.py 
▲ 
5 
$ python3.2 test.py 
▲ 
3 

太好了,我得到兩個不同的字符串大小。希望將字符串包裝在我在網上發現的其中一個包裝中會有幫助?

tstr = text_type(" ▲ ")對於:

$ python2.7 test.py 
Traceback (most recent call last): 
    File "test.py", line 21, in <module> 
    tstr = text_type(" ▲ ") 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128) 
$ python3.2 test.py 
▲ 
3 

對於tstr = u(" ▲ ")

$ python2.7 test.py 
Traceback (most recent call last): 
    File "test.py", line 21, in <module> 
    tstr = u(" ▲ ") 
    File "test.py", line 11, in u 
    return unicode(x) 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128) 
$ python3.2 test.py 
▲ 
3 

對於tstr = b(" ▲ ")

$ python2.7 test.py 
▲ 
5 
$ python3.2 test.py 
Traceback (most recent call last): 
    File "test.py", line 21, in <module> 
    tstr = b(" ▲ ") 
    File "test.py", line 17, in b 
    return codecs.latin_1_encode(x)[0] 
UnicodeEncodeError: 'latin-1' codec can't encode character '\u25b2' in position 1: ordinal not in range(256) 

對於tstr = binary_type(" ▲ ")

$ python2.7 test.py 
▲ 
5 
$ python3.2 test.py 
Traceback (most recent call last): 
    File "test.py", line 21, in <module> 
    tstr = binary_type(" ▲ ") 
TypeError: string argument without an encoding 

那麼,這當然會讓事情變得簡單。

那麼,如何在Python 2.7和3.2中獲得相同的字符串長度(本例中爲3)呢?

回答

3

嘛,原來unicode()在Python 2.7有encoding說法,那顯然有助於:

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import sys 
if sys.version_info[0] < 3: 
    text_type = unicode 
    binary_type = str 
    def b(x): 
    return x 
    def u(x): 
    return unicode(x, "utf-8") 
else: 
    text_type = str 
    binary_type = bytes 
    import codecs 
    def b(x): 
    return codecs.latin_1_encode(x)[0] 
    def u(x): 
    return x 

tstr = u(" ▲ ") 

sys.stderr.write(tstr) 
sys.stderr.write("\n") 
sys.stderr.write(str(len(tstr))) 
sys.stderr.write("\n") 

運行,我得到我需要的東西:

$ python2.7 test.py 
▲ 
3 
$ python3.2 test.py 
▲ 
3