2016-09-22 93 views
0

我從ES查詢中獲取不正確的聚合計數。我從ES文檔瞭解基數和術語聚合不準確,但我得到的是有太多差異。我的索引的Elasticsearch聚合計數不匹配

映射是

{ 
     "dynamic_templates": [{ 
      "template_action": { 
       "mapping": { 
        "type": "string", 
        "index": "not_analyzed" 
       }, 
       "match": "*", 
       "match_mapping_type": "*" 
      } 
     }], 
     "_parent": { 
      "type": "users" 
     }, 
     "date_detection": False, 
     "properties": { 
      "traits": { 
       "type": "object" 
      }, 
      "cl_utm_params": { 
       "type": "object" 
      }, 
      "cl_other_params": { 
       "type": "object" 
      }, 
      "cl_triggered_ts": { 
       "type": "date" 
      } 
     } 
    } 

一個示例文檔

{ 
     "client_id": "cl58vivh8w7t", 
     "user_id": "CL.1122029143.1904488380.1218174474.2049762488", 
     "session_id": "CL.1886305621.906039613", 
     "source": "Google", 
     "action": "pageview", 
     "cl_triggered_ts": "2016-09-09T00:13:33.818Z", 
     "browser": "Microsoft Edge 13", 
     "platform": "Windows 10", 
     "screen_size": "1920 x 1080", 
     "device": "Desktop", 
     "ip_address": "98.236.246.165", 
     "country": "United States", 
     "city": "Weirton", 
     "postal_code": "26062", 
     "location": "40.4224, -80.5739", 
     "timezone": "America/New_York", 
     "state": "West Virginia", 
     "continent": "North America", 
     "isp": "Comcast Cable", 
     "browser_language": "", 
     "traits": {}, 
     "cl_utm_params": {}, 
     "cl_other_params": {} 
    } 
從下面查詢

我使用越來越會話唯一沒有爲每個源和每個設備唯一的無會話由源桶和公制聚合

{ 
    "query": { 
    "bool": { 
     "must": [ 
      {"match": {"client_id": "cl58vivh8w7t"}} 
     ] 
    } 
    }, 
    "aggs": { 
    "top_source": { 
     "terms": { 
      "field": "source" 
     }, 
     "aggs": { 
      "total_unique_sessions": {"cardinality": {"field": "session_id"}}, 
      "per_device": { 
       "terms": {"field": "device"}, 
       "aggs": {"device_session": {"cardinality": {"field": "session_id"}}} 
       } 
      } 
     } 
    }, 
    "size": 0 
} 

供參考我給了一個單鬥下面。從這裏,每個設備的會話值的總和應該等於total_unique_sessions值。

我懷疑我的查詢或計算有問題嗎?

{ 
     "key": "www.google.com", 
     "doc_count": 68947, 
     "per_device": { 
     "doc_count_error_upper_bound": 0, 
     "sum_other_doc_count": 0, 
     "buckets": [ 
      { 
      "key": "Desktop", 
      "doc_count": 49254, 
      "device_session": { 
       "value": 2413 
      } 
      }, 
      { 
      "key": "Mobile", 
      "doc_count": 16317, 
      "device_session": { 
       "value": 3222 
      } 
      }, 
      { 
      "key": "Tablet", 
      "doc_count": 3343, 
      "device_session": { 
       "value": 636 
      } 
      }, 
      { 
      "key": "TV", 
      "doc_count": 33, 
      "device_session": { 
       "value": 9 
      } 
      } 
     ] 
     }, 
     "total_unique_sessions": { 
     "value": 9058 
     } 
    } 

回答

0

我看你有使用匹配查詢。

通常我們對聚合進行術語查詢。我認爲比賽導致了這個問題。

{ 
    "query": { 
    "bool": { 
     "must": [ 
      {"term": {"client_id": "cl58vivh8w7t"}} 
     ] 
    } 
    }, 
    "aggs": { 
    "top_source": { 
     "terms": { 
      "field": "source" 
     }, 
     "aggs": { 
      "total_unique_sessions": {"cardinality": {"field": "session_id"}}, 
      "per_device": { 
       "terms": {"field": "device"}, 
       "aggs": {"device_session": {"cardinality": {"field": "session_id"}}} 
       } 
      } 
     } 
    }, 
    "size": 0 
} 
+0

是的,在分析的字段上使用術語匹配以獲得更快的結果是很好的,但是由於client_id沒有被分析,所以在這個查詢中沒有意義。兩者結果相同。 – shivg