設置編解碼器/在Elasticsearch中搜索來自Python的unicode值

這個問題可能是由於我對ELK，Python和Unicode毫無興趣。設置編解碼器/在Elasticsearch中搜索來自Python的unicode值

我有一個包含logstash消化日誌的索引，包括一個字段'host_req'，它包含一個主機名。使用Elasticsearch-py，我將該主機名稱從記錄中提取出來，然後使用它在另一個索引中進行搜索。但是，如果主機名包含多字節字符，則它會因UnicodeDecodeError失敗。當我使用'curl -XGET'從命令行輸入時，完全相同的查詢正常工作。 unicode字符是帶小寫字母的小寫字母'a'（兩個點）。 UTF-8的值是C3 A4，並且unicode代碼點似乎是00E4（語言是瑞典語）。

這些捲曲的命令只是正常工作的命令行：

curl -XGET 'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' { "query" : {"match" :{"req_host" : "www.utkl\u00E4dningskl\u00E4derna.se" }}}' 
curl -XGET 'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' { "query" : {"match" :{"req_host" : "www.utklädningskläderna.se" }}}'

他們發現並返回記錄

（第二行顯示的主機名的顯示方式我把它從日誌中，顯示小寫字母'a'與diaersis，在兩個地方）

我寫了一個很短的Python腳本來顯示問題：它使用硬連線查詢，打印它們和它們的類型，然後嘗試在它們中使用它們搜索。

#!/usr/bin/python 
# -*- coding: utf-8 -*- 

import json 
import elasticsearch 

es = elasticsearch.Elasticsearch() 

if __name__=="__main__": 
    #uq = u'{ "query": { "match": { "req_host": "www.utklädningskläderna.se" }}}'   # raw utf-8 characters. does not work 
    #uq = u'{ "query": { "match": { "req_host": "www.utkl\u00E4dningskl\u00E4derna.se" }}}' # quoted unicode characters. does not work 
    #uq = u'{ "query": { "match": { "req_host": "www.utkl\uC3A4dningskl\uC3A4derna.se" }}}' # quoted utf-8 characters. does not work 
    uq = u'{ "query": { "match": { "req_host": "www.facebook.com" }}}'      # non-unicode. works fine 
    print "uq", type(uq), uq 
    result = es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq); 
    if result["hits"]["total"] == 0: 
    print "nothing found" 
    else: 
    print "found some"

如果我運行如圖所示，與「Facebook的查詢，它的罰款 - 輸出：

$python testutf8b.py 
uq <type 'unicode'> { "query": { "match": { "req_host": "www.facebook.com" }}} 
found some

注意查詢字符串 'UQ' 是unicode。

但是，如果我使用其他三個字符串，其中包括Unicode字符，它會爆炸。例如，在第二行中，我得到：

$python testutf8b.py 
uq <type 'unicode'> { "query": { "match": { "req_host": "www.utklädningskläderna.se" }}} 
Traceback (most recent call last): 
    File "testutf8b.py", line 15, in <module> 
    result = es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq); 
    File "build/bdist.linux-x86_64/egg/elasticsearch/client/utils.py", line 68, in _wrapped 
    File "build/bdist.linux-x86_64/egg/elasticsearch/client/__init__.py", line 497, in search 
    File "build/bdist.linux-x86_64/egg/elasticsearch/transport.py", line 307, in perform_request 
    File "build/bdist.linux-x86_64/egg/elasticsearch/connection/http_urllib3.py", line 82, in perform_request 
elasticsearch.exceptions.ConnectionError: ConnectionError('ascii' codec can't decode byte 0xc3 in position 45: ordinal not in range(128)) caused by: UnicodeDecodeError('ascii' codec can't decode byte 0xc3 in position 45: ordinal not in range(128)) 
$

再次注意，查詢字符串是Unicode字符串（是的，源代碼行是一個與\u00E4字符）。

我真的想解決這個問題。我嘗試過各種組合uq = uq.encode("utf=8")和uq = uq.decode("utf=8")，但它似乎沒有幫助。我開始懷疑elasticsearch-py庫中是否存在問題。

謝謝！

PS：這是在Centos 7下，使用ES 1.5.0。日誌被消化到一個略微老版本的ES，使用logstash 1.4.2

來源

2015-04-03 user3587642

我已經確定我*可*運行make從Python的這個查詢* .encode的查詢字符串（「UTF-8」），並將其與發送到原始套接字適當的HTTP標頭。相同的似乎沒有與elasticsearch-py – user3587642 2015-04-03 20:30:39

基本上，你不需要通過body作爲字符串。使用本地Python數據結構。或者即時轉換它們。給一試，請：

>>> import elasticsearch 
>>> es = elasticsearch.Elasticsearch() 
>>> es.index(index='unicode-index', body={'host': u'www.utklädningskläderna.se'}, doc_type='log') 

{u'_id': u'AUyGJuFMy0qdfghJ6KwJ', 
u'_index': u'unicode-index', 
u'_type': u'log', 
u'_version': 1, 
u'created': True} 

>>> es.search(index='unicode-index', body={}, doc_type='log') 

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5}, 
u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ', 
    u'_index': u'unicode-index', 
    u'_score': 1.0, 
    u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'}, 
    u'_type': u'log'}], 
    u'max_score': 1.0, 
    u'total': 1}, 
u'timed_out': False, 
u'took': 5} 

>>> es.search(index='unicode-index', body={'query': {'match': {'host': u'www.utklädningskläderna.se'}}}, doc_type='log') 

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5}, 
u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ', 
    u'_index': u'unicode-index', 
    u'_score': 0.30685282, 
    u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'}, 
    u'_type': u'log'}], 
    u'max_score': 0.30685282, 
    u'total': 1}, 
u'timed_out': False, 
u'took': 122} 

>>> import json 

>>> body={'query': {'match': {'host': u'www.utklädningskläderna.se'}}} 

>>> es.search(index='unicode-index', body=body, doc_type='log') 

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5}, 
u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ', 
    u'_index': u'unicode-index', 
    u'_score': 0.30685282, 
    u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'}, 
    u'_type': u'log'}], 
    u'max_score': 0.30685282, 
    u'total': 1}, 
u'timed_out': False, 
u'took': 4} 

>>> es.search(index='unicode-index', body=json.dumps(body), doc_type='log') 

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5}, 
u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ', 
    u'_index': u'unicode-index', 
    u'_score': 0.30685282, 
    u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'}, 
    u'_type': u'log'}], 
    u'max_score': 0.30685282, 
    u'total': 1}, 
u'timed_out': False, 
u'took': 5} 

>>> json.dumps(body) 
'{"query": {"match": {"host": "www.utkl\\u00e4dningskl\\u00e4derna.se"}}}'

來源

2015-04-05 18:19:17 Slam

一起工作謝謝！這工作。 – user3587642 2015-04-06 14:08:28

設置編解碼器/在Elasticsearch中搜索來自Python的unicode值

回答

相關問題