2015-08-18 75 views
2

我在elasticsearch上有一個索引,在它的記錄中有一個數組。 說出字段名稱爲 「樣品」 和陣列是:從elasticsearch查詢中獲取指定的數組元素

[ 「ABC」, 「XYZ」, 「MNP」 .....]

那麼,有沒有查詢,以便我可以指定從數組中檢索的元素數量。 假設我希望檢索的記錄應該只有樣本數組中的前2個元素

回答

0

假設您將字符串數組作爲文檔。我腦海中有一些想法可以幫助你。

PUT /arrayindex/ 
{ 
    "settings": { 
    "index": { 
     "analysis": { 
     "analyzer": { 
      "spacelyzer": { 
      "tokenizer": "whitespace" 
      }, 
      "commalyzer": { 
      "type": "custom", 
      "tokenizer": "commatokenizer", 
      "char_filter": "square_bracket" 
      } 
     }, 
     "tokenizer": { 
      "commatokenizer": { 
      "type": "pattern", 
      "pattern": "," 
      } 
     }, 
     "char_filter": { 
      "square_bracket": { 
      "type": "mapping", 
      "mappings": [ 
       "[=>", 
       "]=>" 
      ] 
      } 
     } 
     } 
    } 
    }, 
    "mappings": { 
    "array_set": { 
     "properties": { 
     "array_space": { 
      "analyzer": "spacelyzer", 
      "type": "string" 
     }, 
     "array_comma": { 
      "analyzer": "commalyzer", 
      "type": "string" 
     } 
     } 
    } 
    } 
} 

POST /arrayindex/array_set/1 
{ 
    "array_space": "qwer qweee trrww ooenriwu njj" 
} 

POST /arrayindex/array_set/2 
{ 
    "array_comma": "[qwer,qweee,trrww,ooenriwu,njj]" 
} 

上面DSL接受兩種類型的陣列之一是空白分隔的字符串,每一個字符串代表數組的一個元素,而另一個是一個類型的數組,是由指定。這是可能的Python和python數組,如果你索引這樣的文件它會自動轉換爲字符串,即["abc","xyz","mnp".....]將被轉換爲"["abc","xyz","mnp".....]"

spacelyzer根據空格標記化,commalyzer根據逗號標記化並從字符串中刪除[ and ]

現在,如果你會的Termvector API這樣的:

GET arrayindex/array_set/1/_termvector 
{ 
    "fields" : ["array_space", "array_comma"], 
    "term_statistics" : true, 
    "field_statistics" : true 
} 

GET arrayindex/array_set/2/_termvector 
{ 
    "fields" : ["array_space", "array_comma"], 
    "term_statistics" : true, 
    "field_statistics" : true 
} 

您可以簡單地從他們的反應如獲得元素的位置找到"njj"使用

  • termvector_response["term_vectors"]["array_comma"]["terms"]["njj"]["tokens"][0]["position"]或位置,

  • termvector_response["term_vectors"]["array_space"]["terms"]["njj"]["tokens"][0]["position"]

都將給你4這是指定的數組中的實際索引。我建議你到whitespace型號的設計。

的Python代碼可以是:

from elasticsearch import Elasticsearch 

ES_HOST = {"host" : "localhost", "port" : 9200} 
ES_CLIENT = Elasticsearch(hosts = [ES_HOST], timeout = 180) 

def getTermVector(doc_id): 
    a = ES_CLIENT.termvector\ 
     (index = "arrayindex", 
      doc_type = "array_set", 
      id = doc_id, 
      field_statistics = True, 
      fields = ['array_space', 'array_comma'], 
      term_statistics = True) 
    return a 

def getElements(num, array_no): 
    all_terms = getTermVector(array_no)['term_vectors']['array_space']['terms'] 
    for i in range(num): 
     for term in all_terms: 
      for jsons in all_terms[term]['tokens']: 
       if jsons['position'] == i: 
        print term, "@ index", i 


getElements(3, 1) 

# qwer @ index 0 
# qweee @ index 1 
# trrww @ index 2