爲什麼兩個相同的文檔得分不同？

我目前正在搞清楚輪胎寶石（我也是新來的elasticsearch和lucene）並嘗試一些事情。我需要做一些（可能不平凡的）得分，所以我試圖抓住這一點。我閱讀了網上關於評分公式的所有內容，並試圖將所找到的內容與解釋後的查詢進行匹配。爲什麼兩個相同的文檔得分不同？

如果我正確閱讀數字，標題爲「foo foo foo foo」的文檔有不同的分數，這肯定不是預期的。我想我在索引期間或索引後錯過了一個步驟，但我無法弄清楚。

以下是我的代碼。我不會完全按照輪胎DSL的意圖行事，因爲我想弄清楚事情的真相 - 事情在某個時候可能看起來更疲倦。

require 'tire' 
require 'pp' 

class Model 
    INDEX = 'myindex' 
    TYPE = 'company' 

    class << self 
    def delete_index 
     Tire.index(INDEX) { delete } 
    end 

    def create_mapping 
     Tire.index INDEX do 
     create mappings: { 
      TYPE => { 
      properties: { 
       title: { type: 'string' } 
      } 
      } 
     } 
     end 
    end 

    def refresh_index 
     Tire.index INDEX do 
     refresh 
     end 
    end 
    end 

    def initialize(attributes = {}) 
    @attributes = attributes.merge(:_id => object_id) #use oid as id, just for testing 
    end 

    def _type 
    TYPE 
    end 

    def id 
    object_id.to_s #convert to string because tire compares to object_id! 
    end 

    def index 
    item = self 
    Tire.index INDEX do 
     store item 
    end 
    end 

    def to_indexed_json 
    @attributes.to_json 
    end 

    ENTITIES = [ 
    new(title: "foo foo foo foo"), 
    new(title: "foo"), 
    new(title: "bar"), 
    new(title: "foo bar"), 
    new(title: "xxx"), 
    new(title: "foo foo foo foo"), 
    new(title: "foo foo"), 
    new(title: "foo bar baz") 
    ] 

    QUERIES = { 
    :foo => { query_string: { query: "foo" } }, 
    :all => { match_all: {} } 
    } 

    def self.custom_explained_search(q) 
    Tire.search(Model::INDEX, :wrapper => Model, :explain => true) do |search| 
     search.query do |query| 
     query.send :instance_variable_set, :@value, q 
     end 
    end 
    end 
end 

class Tire::Results::Collection 
    def explained 
    @response["hits"]["hits"].map do |hit| 
     { 
     "_id" => hit["_id"], 
     "_explanation" => hit["_explanation"], 
     "title" => hit["_source"]["title"] 
     } 
    end 
    end 
end 

Model.delete_index 
Model.create_mapping 
Model::ENTITIES.each &:index 
Model.refresh_index 
s = Model.custom_explained_search(Model::QUERIES[:foo]) 
pp s.results.explained

打印的結果是這樣的：

[{"_id"=>"2169251840", 
    "_explanation"=> 
    {"value"=>0.54932046, 
    "description"=>"fieldWeight(_all:foo in 0), product of:", 
    "details"=> 
    [{"value"=>1.4142135, 
     "description"=>"btq, product of:", 
     "details"=> 
     [{"value"=>1.4142135, "description"=>"tf(phraseFreq=2.0)"}, 
     {"value"=>1.0, "description"=>"allPayload(...)"}]}, 
     {"value"=>0.7768564, "description"=>"idf(_all: foo=4)"}, 
     {"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=0)"}]}, 
    "title"=>"foo foo foo foo"}, 
{"_id"=>"2169251720", 
    "_explanation"=> 
    {"value"=>0.54932046, 
    "description"=>"fieldWeight(_all:foo in 1), product of:", 
    "details"=> 
    [{"value"=>0.70710677, 
     "description"=>"btq, product of:", 
     "details"=> 
     [{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"}, 
     {"value"=>1.0, "description"=>"allPayload(...)"}]}, 
     {"value"=>0.7768564, "description"=>"idf(_all: foo=4)"}, 
     {"value"=>1.0, "description"=>"fieldNorm(field=_all, doc=1)"}]}, 
    "title"=>"foo"}, 
{"_id"=>"2169250520", 
    "_explanation"=> 
    {"value"=>0.48553526, 
    "description"=>"fieldWeight(_all:foo in 2), product of:", 
    "details"=> 
    [{"value"=>1.0, 
     "description"=>"btq, product of:", 
     "details"=> 
     [{"value"=>1.0, "description"=>"tf(phraseFreq=1.0)"}, 
     {"value"=>1.0, "description"=>"allPayload(...)"}]}, 
     {"value"=>0.7768564, "description"=>"idf(_all: foo=4)"}, 
     {"value"=>0.625, "description"=>"fieldNorm(field=_all, doc=2)"}]}, 
    "title"=>"foo foo"}, 
{"_id"=>"2169251320", 
    "_explanation"=> 
    {"value"=>0.44194174, 
    "description"=>"fieldWeight(_all:foo in 1), product of:", 
    "details"=> 
    [{"value"=>0.70710677, 
     "description"=>"btq, product of:", 
     "details"=> 
     [{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"}, 
     {"value"=>1.0, "description"=>"allPayload(...)"}]}, 
     {"value"=>1.0, "description"=>"idf(_all: foo=1)"}, 
     {"value"=>0.625, "description"=>"fieldNorm(field=_all, doc=1)"}]}, 
    "title"=>"foo bar"}, 
{"_id"=>"2169250380", 
    "_explanation"=> 
    {"value"=>0.27466023, 
    "description"=>"fieldWeight(_all:foo in 3), product of:", 
    "details"=> 
    [{"value"=>0.70710677, 
     "description"=>"btq, product of:", 
     "details"=> 
     [{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"}, 
     {"value"=>1.0, "description"=>"allPayload(...)"}]}, 
     {"value"=>0.7768564, "description"=>"idf(_all: foo=4)"}, 
     {"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=3)"}]}, 
    "title"=>"foo bar baz"}, 
{"_id"=>"2169250660", 
    "_explanation"=> 
    {"value"=>0.2169777, 
    "description"=>"fieldWeight(_all:foo in 0), product of:", 
    "details"=> 
    [{"value"=>1.4142135, 
     "description"=>"btq, product of:", 
     "details"=> 
     [{"value"=>1.4142135, "description"=>"tf(phraseFreq=2.0)"}, 
     {"value"=>1.0, "description"=>"allPayload(...)"}]}, 
     {"value"=>0.30685282, "description"=>"idf(_all: foo=1)"}, 
     {"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=0)"}]}, 
    "title"=>"foo foo foo foo"}]

難道我讀的數字錯了嗎？或濫用Tyre？也許只是缺少一些「重新索引整個集合」的步驟？

來源

2012-06-27 schnittchen

我切換登錄和作爲一系列捲曲調用提取成績單。重播，旋轉。如果我使用'_ curl -X PUT'http：// localhost：9200/myindex/company/2229231160'-d'{「title」：「foo foo foo」，「_ id」：2229231160}''和短捲曲如'curl -X PUT「http：// localhost：9200/myindex/company/6」-d'{「title」：「foo foo foo」，「_ id」：6} 「'。對我來說看起來像一個bug。 – schnittchen

您使用的是什麼版本的elasticsearch – concept47

afaik如果未定義顯式排序字段，則將默認值排序爲tf * idf（http://en.wikipedia.org/wiki/Tf * idf）的變體。

字面上：術語頻率*逆文檔頻率。

維基百科：

期限頻率（術語計數）：給定文檔中的術語計數僅僅是一個給定的詞出現在該文件中的次數

逆文檔頻率是衡量該術語在所有文檔中是普遍的還是罕見的。它是通過將文件的總數由文檔的數量含有的術語，然後採取這一商數

的對數。在這種情況下，排序最有可能的結果的「術語頻率」分量「FOO FOO獲得foo foo「在搜索'foo'時得分高於其他文檔

此外，關於您在更改id時看到的效果：我不確定，但我猜測必須這樣做ES存儲已定購的文檔由id的內部（我不確定）...

如果是這樣的話，具有相同排序分數的2個文檔將根據id排序作爲tiebreaker。你當然可以定義多種排序來改變這種行爲（例如：sort = sorta + desc，sortb + desc。在這種情況下，sortb被用作所有在scoreA上得分相同的文檔的tiebreaker）

來源

2012-07-09 18:40:06

嗯我想我誤解了你的帖子，因爲你在談論2個標題爲「foo foo foo foo」的帖子以得到不同的評分？如果是這樣的話，我不會得分，得分差異來自 –

'tf * idf'是正確的，但值得注意的是，除非你使用[dfs queries]（http://www.elasticsearch.org/guide /參考/ API /搜索/搜索型。html），這是本地碎片上的文檔頻率，而不是整個索引的整體... – Basic

爲什麼兩個相同的文檔得分不同？

回答

相關問題