Elasticsearch分析器中的Python emojis和詞語補償

我在Python客戶端中使用Elasticsearch，我對unicode，ES，分析器和emojis之間的交互有一個疑問。當我嘗試通過ES分析器運行包含表情符號字符的unicode文本字符串時，似乎會在結果輸出中使用術語偏移量。Elasticsearch分析器中的Python emojis和詞語補償

例如：

>> es.indices.analyze(body=u'\U0001f64f testing') 
{u'tokens': [{u'end_offset': 10, 
    u'position': 1, 
    u'start_offset': 3, 
    u'token': u'testing', 
    u'type': u'<ALPHANUM>'}]}

這給了我的長期測試錯誤的偏移。

>> u'\U0001f64f testing'[3:10] 
u'esting'

如果我和別的Unicode外國字符（例如日元符號）做到這一點，我沒有得到同樣的錯誤。

>> es.indices.analyze(body=u'\u00A5 testing') 
{u'tokens': [{u'end_offset': 9, 
    u'position': 1, 
    u'start_offset': 2, 
    u'token': u'testing', 
    u'type': u'<ALPHANUM>'}]} 

>> u'\u00A5 testing'[2:9] 
u'testing'

任何人都可以解釋發生了什麼？

來源

2015-09-19 plam

Python 3.2或更低版本？在Windows上的Python 3.3之前，有一些窄而寬的Unicode版本。縮小版本每個字符使用兩個字節，並使用UTF-16在內部編碼Unicode碼位，這需要兩個UTF-16替代品對U + FFFF以上的Unicode碼位進行編碼。

Python 3.3.5 (v3.3.5:62cf4e77f785, Mar 9 2014, 10:35:05) [MSC v.1600 64 bit (AMD64)] on win32 
Type "help", "copyright", "credits" or "license" for more information. 
>>> len('\U0001f64f') 
1 
>>> '\U0001f64f'[0] 
'\U0001f64f' 

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32 
Type "help", "copyright", "credits" or "license" for more information. 
>>> len(u'\U0001f64f') 
2 
>>> u'\U0001f64f'[0] 
u'\ud83d' 
>>> u'\U0001f64f'[1] 
u'\ude4f'

然而，在你的情況下，偏移是正確的。由於U + 1F64F使用兩個UTF-16的替代物，在的「T」偏移3.我不知道你是如何得到你的輸出：

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32 
Type "help", "copyright", "credits" or "license" for more information. 
>>> x=u'\U0001f64f testing' 
>>> x 
u'\U0001f64f testing' 
>>> x[3:10] 
u'testing' 
>>> y = u'\u00a5 testing' 
>>> y[2:9] 
u'testing'

來源

2015-09-19 08:55:40

在Python 2中，也有窄（你的情況， Windows）和廣泛的CPython在Ubuntu上構建，例如'u'\ U0001f64f'[0] == u'\ U0001f64f''。 – jfs

Elasticsearch分析器中的Python emojis和詞語補償

回答

相關問題