2012-06-27 77 views
2

我目前正在搞清楚輪胎寶石(我也是新來的elasticsearch和lucene)並嘗試一些事情。我需要做一些(可能不平凡的)得分,所以我試圖抓住這一點。我閱讀了網上關於評分公式的所有內容,並試圖將所找到的內容與解釋後的查詢進行匹配。爲什麼兩個相同的文檔得分不同?

如果我正確閱讀數字,標題爲「foo foo foo foo」的文檔有不同的分數,這肯定不是預期的。我想我在索引期間或索引後錯過了一個步驟,但我無法弄清楚。

以下是我的代碼。我不會完全按照輪胎DSL的意圖行事,因爲我想弄清楚事情的真相 - 事情在某個時候可能看起來更疲倦。

require 'tire' 
require 'pp' 

class Model 
    INDEX = 'myindex' 
    TYPE = 'company' 

    class << self 
    def delete_index 
     Tire.index(INDEX) { delete } 
    end 

    def create_mapping 
     Tire.index INDEX do 
     create mappings: { 
      TYPE => { 
      properties: { 
       title: { type: 'string' } 
      } 
      } 
     } 
     end 
    end 

    def refresh_index 
     Tire.index INDEX do 
     refresh 
     end 
    end 
    end 

    def initialize(attributes = {}) 
    @attributes = attributes.merge(:_id => object_id) #use oid as id, just for testing 
    end 

    def _type 
    TYPE 
    end 

    def id 
    object_id.to_s #convert to string because tire compares to object_id! 
    end 

    def index 
    item = self 
    Tire.index INDEX do 
     store item 
    end 
    end 

    def to_indexed_json 
    @attributes.to_json 
    end 

    ENTITIES = [ 
    new(title: "foo foo foo foo"), 
    new(title: "foo"), 
    new(title: "bar"), 
    new(title: "foo bar"), 
    new(title: "xxx"), 
    new(title: "foo foo foo foo"), 
    new(title: "foo foo"), 
    new(title: "foo bar baz") 
    ] 

    QUERIES = { 
    :foo => { query_string: { query: "foo" } }, 
    :all => { match_all: {} } 
    } 

    def self.custom_explained_search(q) 
    Tire.search(Model::INDEX, :wrapper => Model, :explain => true) do |search| 
     search.query do |query| 
     query.send :instance_variable_set, :@value, q 
     end 
    end 
    end 
end 

class Tire::Results::Collection 
    def explained 
    @response["hits"]["hits"].map do |hit| 
     { 
     "_id" => hit["_id"], 
     "_explanation" => hit["_explanation"], 
     "title" => hit["_source"]["title"] 
     } 
    end 
    end 
end 

Model.delete_index 
Model.create_mapping 
Model::ENTITIES.each &:index 
Model.refresh_index 
s = Model.custom_explained_search(Model::QUERIES[:foo]) 
pp s.results.explained 

打印的結果是這樣的:

[{"_id"=>"2169251840", 
    "_explanation"=> 
    {"value"=>0.54932046, 
    "description"=>"fieldWeight(_all:foo in 0), product of:", 
    "details"=> 
    [{"value"=>1.4142135, 
     "description"=>"btq, product of:", 
     "details"=> 
     [{"value"=>1.4142135, "description"=>"tf(phraseFreq=2.0)"}, 
     {"value"=>1.0, "description"=>"allPayload(...)"}]}, 
     {"value"=>0.7768564, "description"=>"idf(_all: foo=4)"}, 
     {"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=0)"}]}, 
    "title"=>"foo foo foo foo"}, 
{"_id"=>"2169251720", 
    "_explanation"=> 
    {"value"=>0.54932046, 
    "description"=>"fieldWeight(_all:foo in 1), product of:", 
    "details"=> 
    [{"value"=>0.70710677, 
     "description"=>"btq, product of:", 
     "details"=> 
     [{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"}, 
     {"value"=>1.0, "description"=>"allPayload(...)"}]}, 
     {"value"=>0.7768564, "description"=>"idf(_all: foo=4)"}, 
     {"value"=>1.0, "description"=>"fieldNorm(field=_all, doc=1)"}]}, 
    "title"=>"foo"}, 
{"_id"=>"2169250520", 
    "_explanation"=> 
    {"value"=>0.48553526, 
    "description"=>"fieldWeight(_all:foo in 2), product of:", 
    "details"=> 
    [{"value"=>1.0, 
     "description"=>"btq, product of:", 
     "details"=> 
     [{"value"=>1.0, "description"=>"tf(phraseFreq=1.0)"}, 
     {"value"=>1.0, "description"=>"allPayload(...)"}]}, 
     {"value"=>0.7768564, "description"=>"idf(_all: foo=4)"}, 
     {"value"=>0.625, "description"=>"fieldNorm(field=_all, doc=2)"}]}, 
    "title"=>"foo foo"}, 
{"_id"=>"2169251320", 
    "_explanation"=> 
    {"value"=>0.44194174, 
    "description"=>"fieldWeight(_all:foo in 1), product of:", 
    "details"=> 
    [{"value"=>0.70710677, 
     "description"=>"btq, product of:", 
     "details"=> 
     [{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"}, 
     {"value"=>1.0, "description"=>"allPayload(...)"}]}, 
     {"value"=>1.0, "description"=>"idf(_all: foo=1)"}, 
     {"value"=>0.625, "description"=>"fieldNorm(field=_all, doc=1)"}]}, 
    "title"=>"foo bar"}, 
{"_id"=>"2169250380", 
    "_explanation"=> 
    {"value"=>0.27466023, 
    "description"=>"fieldWeight(_all:foo in 3), product of:", 
    "details"=> 
    [{"value"=>0.70710677, 
     "description"=>"btq, product of:", 
     "details"=> 
     [{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"}, 
     {"value"=>1.0, "description"=>"allPayload(...)"}]}, 
     {"value"=>0.7768564, "description"=>"idf(_all: foo=4)"}, 
     {"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=3)"}]}, 
    "title"=>"foo bar baz"}, 
{"_id"=>"2169250660", 
    "_explanation"=> 
    {"value"=>0.2169777, 
    "description"=>"fieldWeight(_all:foo in 0), product of:", 
    "details"=> 
    [{"value"=>1.4142135, 
     "description"=>"btq, product of:", 
     "details"=> 
     [{"value"=>1.4142135, "description"=>"tf(phraseFreq=2.0)"}, 
     {"value"=>1.0, "description"=>"allPayload(...)"}]}, 
     {"value"=>0.30685282, "description"=>"idf(_all: foo=1)"}, 
     {"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=0)"}]}, 
    "title"=>"foo foo foo foo"}] 

難道我讀的數字錯了嗎?或濫用Tyre?也許只是缺少一些「重新索引整個集合」的步驟?

+0

我切換登錄和作爲一系列捲曲調用提取成績單。重播,旋轉。如果我使用'_ curl -X PUT'http:// localhost:9200/myindex/company/2229231160'-d'{「title」:「foo foo foo」,「_ id」 :2229231160}''和短捲曲如'curl -X PUT「http:// localhost:9200/myindex/company/6」-d'{「title」:「foo foo foo」,「_ id」:6} 「'。對我來說看起來像一個bug。 – schnittchen

+0

您使用的是什麼版本的elasticsearch – concept47

回答

2

afaik如果未定義顯式排序字段,則將默認值排序爲tf * idf(http://en.wikipedia.org/wiki/Tf * idf)的變體。

字面上:術語頻率*逆文檔頻率。

維基百科:

期限頻率(術語計數):給定文檔中的術語計數僅僅是一個給定的詞出現在該文件中的次數

逆文檔頻率是衡量該術語在所有文檔中是普遍的還是罕見的。它是通過將文件的總數由文檔的數量含有的術語,然後採取這一商數

的對數。在這種情況下,排序最有可能的結果的「術語頻率」分量「FOO FOO獲得foo foo「在搜索'foo'時得分高於其他文檔

此外,關於您在更改id時看到的效果:我不確定,但我猜測必須這樣做ES存儲已定購的文檔由id的內部(我不確定)...

如果是這樣的話,具有相同排序分數的2個文檔將根據id排序作爲tiebreaker。你當然可以定義多種排序來改變這種行爲(例如:sort = sorta + desc,sortb + desc。在這種情況下,sortb被用作所有在scoreA上得分相同的文檔的tiebreaker)

+1

嗯我想我誤解了你的帖子,因爲你在談論2個標題爲「foo foo foo foo」的帖子以得到不同的評分?如果是這樣的話,我不會得分,得分差異來自 –

+0

'tf * idf'是正確的,但值得注意的是,除非你使用[dfs queries](http://www.elasticsearch.org/guide /參考/ API /搜索/搜索型。html),這是本地碎片上的文檔頻率,而不是整個索引的整體... – Basic

相關問題