Unicode正則表達式的Python正則表達式標記不按預期工作

我遇到了正則表達式標記化和Unicode字符串的一個奇怪問題。Unicode正則表達式的Python正則表達式標記不按預期工作

> mystring = "Unicode rägular expressions" 
> tokens = re.findall(r'\w+', mystring, re.UNICODE)

這就是我得到：

> print tokens 
['Unicode', 'r\xc3', 'gular', 'expressions']

這是我所期待的：

> print tokens 
['Unicode', 'rägular', 'expressions']

我有什麼做的就是預期的結果？

'\ w'不包含像ä一樣的Unicode字符。 – Xufox

那麼這樣做的方法是什麼？ – boadescriptor

\ w如果您使用re.UNICODE，則包含unicode。 – Javier

該字符串必須是unicode。

mystring = u"Unicode rägular expressions" 
tokens = re.findall(r'\w+', mystring, re.UNICODE)

2015-04-18 17:41:18 Javier

就是這樣。 Python的Unicode是一個巨大的頭痛... – boadescriptor

@boadescriptor：觀看http://nedbatchelder.com/text/unipain.html以減少頭痛。堅持Unicode三明治，儘可能避免處理編碼字節。 –

您有Latin-1或Windows Codepage 1252字節，而不是Unicode文本。解碼您的輸入：

tokens = re.findall(r'\w+', mystring.decode('cp1252'), re.UNICODE)

的編碼字節可以根據使用的編解碼器意味着什麼，它不是一個特定的Unicode碼點。對於字節字符串（類型str），當使用\w時，只能匹配ASCII字符。

2015-04-18 17:44:03

回答