2012-11-04 30 views
2

我一直在嘗試使用facet來獲取術語字段的頻率。我的查詢只返回一個命中,所以我想讓該方面返回特定字段中頻率最高的術語。elasticsearch - 單個字段的返回術語頻率

我的映射:

{ 
"mappings":{ 
    "document":{ 
     "properties":{ 
      "tags":{ 
       "type":"object", 
       "properties":{ 
        "title":{ 
         "fields":{ 
          "partial":{ 
           "search_analyzer":"main", 
           "index_analyzer":"partial", 
           "type":"string", 
           "index" : "analyzed" 
          } 
          "title":{ 
           "type":"string", 
           "analyzer":"main", 
           "index" : "analyzed" 
          } 
         }, 
         "type":"multi_field" 
        } 
       } 
      } 
     } 
    } 
}, 

"settings":{ 
    "analysis":{ 
     "filter":{ 
      "name_ngrams":{ 
       "side":"front", 
       "max_gram":50, 
       "min_gram":2, 
       "type":"edgeNGram" 
      } 
     }, 

     "analyzer":{ 
      "main":{ 
       "filter": ["standard", "lowercase", "asciifolding"], 
       "type": "custom", 
       "tokenizer": "standard" 
      }, 
      "partial":{ 
       "filter":["standard","lowercase","asciifolding","name_ngrams"], 
       "type": "custom", 
       "tokenizer": "standard" 
      } 
     } 
    } 
} 

} 

測試數據:

curl -XPUT localhost:9200/testindex/document -d '{"tags": {"title": "people also kill people"}}' 

查詢:

curl -XGET 'localhost:9200/testindex/document/_search?pretty=1' -d ' 
{ 
    "query": 
    { 
     "term": { "tags.title": "people" } 
    }, 
    "facets": { 
     "popular_tags": { "terms": {"field": "tags.title"}} 
    } 
}' 

這個結果

"hits" : { 
    "total" : 1, 
    "max_score" : 0.99381393, 
    "hits" : [ { 
    "_index" : "testindex", 
    "_type" : "document", 
    "_id" : "uI5k0wggR9KAvG9o7S7L2g", 
    "_score" : 0.99381393, "_source" : {"tags": {"title": "people also kill people"}} 
} ] 
}, 
"facets" : { 
    "popular_tags" : { 
    "_type" : "terms", 
    "missing" : 0, 
    "total" : 3, 
    "other" : 0, 
    "terms" : [ { 
    "term" : "people", 
    "count" : 1   // I expect this to be 2 
    }, { 
    "term" : "kill", 
    "count" : 1 
    }, { 
    "term" : "also", 
    "count" : 1 
    } ] 
} 

}

以上結果不是我想要的。我想讓頻率數爲2

"hits" : { 
    "total" : 1, 
    "max_score" : 0.99381393, 
    "hits" : [ { 
    "_index" : "testindex", 
    "_type" : "document", 
    "_id" : "uI5k0wggR9KAvG9o7S7L2g", 
    "_score" : 0.99381393, "_source" : {"tags": {"title": "people also kill people"}} 
} ] 
}, 
"facets" : { 
"popular_tags" : { 
    "_type" : "terms", 
    "missing" : 0, 
    "total" : 3, 
    "other" : 0, 
    "terms" : [ { 
    "term" : "people", 
    "count" : 2    
    }, { 
    "term" : "kill", 
    "count" : 1 
    }, { 
    "term" : "also", 
    "count" : 1 
    } ] 
} 
} 

我該如何做到這一點?面對錯誤的路要走嗎?

+0

我可以知道我的答案是否有幫助嗎? – javanna

+0

是的,這真的很有幫助 – Kennedy

回答

6

一個方面計數的文件,而不是屬於他們的條款。你得到1,因爲只有一個文件包含該術語,發生多少次並不重要。我不知道用什麼方法可以返回術語頻率,但這一面並不是一個好的選擇。
如果啓用術語向量,那麼可以將這些信息存儲在索引中,但現在無法從elasticsearch讀取術語向量。

+0

有沒有辦法做到這一點,而不使用方面? – brycemcd

+3

當term_vectors暴露(但您確實需要存儲term_vectors)時,有1.0(beta2可用):http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-termvectors.html。 – javanna

0

不幸的是,字段的頻率在Elastic中不可用。 GitHub項目Index TermList正在使用Lucene的條款並計算所有文檔的總次數,您可以檢查它並根據您的需要進行替換。