Python將\ uxxxx作爲字符串文字內的Unicode字符轉義(例如,u「\ u2014」被解釋爲Unicode字符U + 2014)。但我剛剛發現(Python 2.7)標準正則表達式模塊不會將\ uxxxx當作unicode字符。例如:python re(regex)是否有一個替代 u的unicode轉義序列?
codepoint = 2014 # Say I got this dynamically from somewhere
test = u"This string ends with \u2014"
pattern = r"\u%s$" % codepoint
assert(pattern[-5:] == "2014$") # Ends with an escape sequence for U+2014
assert(re.search(pattern, test) != None) # Failure -- No match (bad)
assert(re.search(pattern, "u2014")!= None) # Success -- This matches (bad)
顯然,如果你可以指定你的正則表達式作爲一個字符串,那麼你可以有同樣的效果,如果正則表達式引擎本身理解爲\ uXXXX轉義:
test = u"This string ends with \u2014"
pattern = u"\u2014$"
assert(pattern[:-1] == u"\u2014") # Ends with actual unicode char U+2014
assert(re.search(pattern, test) != None)
但如果你需要動態構建你的模式呢?
您正在創建一個字符串''\ u%s「,然後插入代碼點,並且*不*首先被解釋爲'\ u ....'。這是*預期的行爲*。改用'u'%s'%unichr(codepoint)'。 – 2013-05-14 11:16:20