filterby，groupby唯一字段值，總和聚合，orderby在elasticsearch查詢鏈

我想發出一個查詢，以彈性搜索過濾，按組，按總和聚合和排序。我有兩個問題：查詢應該如何以及彈性搜索對性能的影響是什麼？filterby，groupby唯一字段值，總和聚合，orderby在elasticsearch查詢鏈

讓我舉一個數據集來支持我的問題。比方說，我有一個集銷售：

document type: 'sales' with the following fields and data: 
sale_datetime | sold_product | sold_at_price 
-----------------|---------------|-------------- 
2015-11-24 12:00 | some product | 100 
2015-11-24 12:30 | some product | 100 
2015-11-24 12:30 | other product | 100 
2015-11-24 13:00 | other product | 100 
2015-11-24 12:30 | some product | 200 
2015-11-24 13:00 | some product | 200

我想發出一個查詢，其中：

只考慮在時間間隔從2015年11月24日12:15銷售到2015年11月24日12點45
組的結果通過sold_product字段
計算在順序
返回行中的「過度每個產品sold_at_price值總和」，超過每PR sold_at_price值最大的「總和oduct'先來，然後是第二個，等等。

它應用到上面設置的樣本數據，它會返回以下結果：

sold_product | sum of sold_at_price 
--------------|-------------- 
some product | 300  // takes into account rows 2 and 5 
other product | 100  // takes into account row 3

如果有可能發出這樣的詢問，什麼是elasticsearch性能的重要意義？如果它的事項進行審議：

有很多（數十萬，數百萬潛在的未來）的獨特產品
產品名稱可以包含多個（幾十）字/項（這是可能產生一個唯一的產品名稱只包含1個字，但它幾乎是數據量的兩倍）
通常有很多（百萬）記錄滿足時間範圍過濾器（在某些情況下，過濾器可以縮小到幾萬記錄在一定的時間範圍內，但不能保證）

在此先感謝您的幫助！

來源

2015-11-24 Andrew

這是aggregations的典型使用案例。我們首先創建一個索引並建模數據的映射。我們有一個正常的date field for sold_datetime，另一個numeric field for sold_at_price和一個multi-field of type string for sold_product。你會發現，這種多領域有子場稱爲raw是not_analyzed，將被用於創建上的產品名稱匯聚：

curl -XPUT localhost:9200/sales -d '{ 
    "mappings": { 
    "sale": { 
     "properties": { 
     "sale_datetime": { 
      "type": "date" 
     }, 
     "sold_product": { 
      "type": "string", 
      "fields": { 
      "raw": { 
       "type": "string", 
       "index": "not_analyzed" 
      } 
      } 
     }, 
     "sold_at_price": { 
      "type": "double" 
     } 
     } 
    } 
    } 
}'

現在，讓我們指數的樣本數據集使用_bulk端點新指數：

curl -XPOST localhost:9200/sales/sale/_bulk -d ' 
{"index": {}} 
{"sold_datetime": "2015-11-24T12:00:00.000Z", "sold_product":"some product", "sold_at_price": 100} 
{"index": {}} 
{"sold_datetime": "2015-11-24T12:30:00.000Z", "sold_product":"some product", "sold_at_price": 100} 
{"index": {}} 
{"sold_datetime": "2015-11-24T12:30:00.000Z", "sold_product":"other product", "sold_at_price": 100} 
{"index": {}} 
{"sold_datetime": "2015-11-24T13:00:00.000Z", "sold_product":"other product", "sold_at_price": 100} 
{"index": {}} 
{"sold_datetime": "2015-11-24T12:30:00.000Z", "sold_product":"some product", "sold_at_price": 200} 
{"index": {}} 
{"sold_datetime": "2015-11-24T13:00:00.000Z", "sold_product":"some product", "sold_at_price": 200} 
'

最後，讓我們來創建你所需要的查詢和彙總：

curl -XPOST localhost:9200/sales/sale/_search -d '{ 
    "size": 0, 
    "query": { 
    "filtered": { 
     "filter": { 
     "range": { 
      "sold_datetime": { 
      "gt": "2015-11-24T12:15:00.000Z", 
      "lt": "2015-11-24T12:45:00.000Z" 
      } 
     } 
     } 
    } 
    }, 
    "aggs": { 
    "sold_products": { 
     "terms": { 
     "field": "sold_product.raw", 
     "order": { 
      "total": "desc" 
     } 
     }, 
     "aggs": { 
     "total": { 
      "sum": { 
      "field": "sold_at_price" 
      } 
     } 
     } 
    } 
    } 
}'

正如您所見，我們正在篩選sold_datetime字段的特定日期間隔（11月24日12：15-12：45）。聚合部分在sold_product.raw字段上定義terms aggregation，併爲每個桶我們sum字段的值爲sold_at_price。

請注意，如果您有幾百萬個可能匹配的文檔，爲了使其具有高性能，您需要首先應用最積極的過濾器，也許是您運行查詢的業務的標識，或者某些其他標準將在運行聚合之前排除儘可能多的文檔。

結果看起來是這樣的：

{ 
    ... 
    "aggregations" : { 
    "sold_products" : { 
     "doc_count_error_upper_bound" : 0, 
     "sum_other_doc_count" : 0, 
     "buckets" : [ { 
     "key" : "some product", 
     "doc_count" : 2, 
     "total" : { 
      "value" : 300.0 
     } 
     }, { 
     "key" : "other product", 
     "doc_count" : 1, 
     "total" : { 
      "value" : 100.0 
     } 
     } ] 
    } 
    } 
}

來源

2015-11-25 05:08:46 Val

謝謝！那是我需要的。我會考慮如何應用更多的過濾器來減少處理記錄的總數。 – Andrew

很高興幫助！ – Val

filterby，groupby唯一字段值，總和聚合，orderby在elasticsearch查詢鏈

回答

相關問題