彈性搜索中「_id」字段對搜索方法的影響？

我在彈性搜索方面遇到了一些麻煩......我設法在我的機器上創建了一個可重複使用的示例，代碼位於帖子末尾。彈性搜索中「_id」字段對搜索方法的影響？

我只是創建6個用戶，"Roger Sand"，"Roger Gilbert"，"Cindy Sand"，"Cindy Gilbert"，"Jean-Roger Sands"，"Sand Roger"，並通過其名稱索引它。

然後我運行一個查詢來匹配「Roger Sand」，並顯示相關的分數。

下面是執行相同的腳本，其中有兩組differents ID：84046到84051和84047到84052（剛剛移位1）。

結果不是以相同的順序，並有不一樣的比分：

執行與84046 ... 84051

Sand Roger => 0.8838835 
Roger Sand => 0.2712221 
Cindy Sand => 0.22097087 
Jean-Roger Sands => 0.17677669 
Roger Gilbert => 0.028130025

執行與84047..84052

Roger Sand => 0.2712221 
Sand Roger => 0.2712221 
Cindy Sand => 0.22097087 
Jean-Roger Sands => 0.17677669 
Roger Gilbert => 0.15891947

我的問題是爲什麼「id」對搜索有影響通過「full_name」？

這是一個完整的可複製腳本的ruby代碼。

first_id = 84046 # Or 84047 
client = Elasticsearch::Client.new(:log => true) 
client.transport.reload_connections! 
client.indices.delete({:index => 'test'}) 
client.indices.create({ :index => 'test' }) 
client.perform_request('POST', 'test/_refresh') 

["Roger Sand", "Roger Gilbert", "Cindy Sand", "Cindy Gilbert", "Jean-Roger Sands", "Sand Roger" ].each_with_index do |name, i| 
    i2 = first_id + i 
    client.create({ 
    :index => 'test', :type => 'user', 
    :id => i2, 
    :body => { :full_name => name } 
    }) 
end 

query_options = { 
    :type => 'user', :index => 'test', 
    :body => { 
    :query => { :match => { :full_name => "Roger Sand" } } 
    } 
} 

client.perform_request('POST', 'test/_refresh') 

client.search(query_options)["hits"]["hits"].each do |hit| 
    $stderr.puts "#{hit["_source"]["full_name"]} => #{hit["_score"]}" 
end

這裏是一個命令行

curl -XDELETE 'http://localhost:9200/test' 
curl -XPUT 'http://localhost:9200/test' 
curl -XPOST 'http://localhost:9200/test/_refresh' 
curl -XPUT 'http://localhost:9200/test/user/84047?op_type=create' -d '{"full_name":"Roger Sand"}' 
curl -XPUT 'http://localhost:9200/test/user/84048?op_type=create' -d '{"full_name":"Roger Gilbert"}' 
curl -XPUT 'http://localhost:9200/test/user/84049?op_type=create' -d '{"full_name":"Cindy Sand"}' 
curl -XPUT 'http://localhost:9200/test/user/84050?op_type=create' -d '{"full_name":"Cindy Gilbert"}' 
curl -XPUT 'http://localhost:9200/test/user/84051?op_type=create' -d '{"full_name":"Jean-Roger Sands"}' 
curl -XPUT 'http://localhost:9200/test/user/84052?op_type=create' -d '{"full_name":"Sand Roger"}' 
curl -XPOST 'http://localhost:9200/test/_refresh' 
curl -XPOST 'http://localhost:9200/test/user/_search?pretty' -d '{"query":{"match":{"full_name":"Roger Sand"}}}' 


curl -XDELETE 'http://localhost:9200/test' 
curl -XPUT 'http://localhost:9200/test' 
curl -XPOST 'http://localhost:9200/test/_refresh' 
curl -XPUT 'http://localhost:9200/test/user/84046?op_type=create' -d '{"full_name":"Roger Sand"}' 
curl -XPUT 'http://localhost:9200/test/user/84047?op_type=create' -d '{"full_name":"Roger Gilbert"}' 
curl -XPUT 'http://localhost:9200/test/user/84048?op_type=create' -d '{"full_name":"Cindy Sand"}' 
curl -XPUT 'http://localhost:9200/test/user/84049?op_type=create' -d '{"full_name":"Cindy Gilbert"}' 
curl -XPUT 'http://localhost:9200/test/user/84050?op_type=create' -d '{"full_name":"Jean-Roger Sands"}' 
curl -XPUT 'http://localhost:9200/test/user/84051?op_type=create' -d '{"full_name":"Sand Roger"}' 
curl -XPOST 'http://localhost:9200/test/_refresh' 
curl -XPOST 'http://localhost:9200/test/user/_search?pretty' -d '{"query":{"match":{"full_name":"Roger Sand"}}}'

來源

2014-01-29 pierallard

問題在於分佈式分數計算。

您使用默認設置創建了一個新的索引，即5個分片。每個分片都是它自己的Lucene索引。當您爲數據建立索引時，Elasticsearch需要決定文檔應該到哪個分片，並且通過對_id進行散列（在沒有路由參數的情況下）。

因此，通過移動ID，您最終將文檔分發給不同的分片。如上所述，每個分片都是它自己的Lucene索引，當您搜索多個分片時，必須將每個分片的不同分數相結合，並且由於不同的路由，各個分數是不同的。

您可以通過將explain添加到您的查詢來驗證此問題。對於Sand Roger，idf分別計算爲idf(docFreq=1, maxDocs=1) = 0.30685282和idf(docFreq=1, maxDocs=2) = 1，這會產生不同的結果。

您可以將分片大小更改爲1，或將查詢類型更改爲dfs類型。搜索對http://localhost:9200/test/user/_search?pretty&query_type=dfs_query_and_fetch會給你正確的分數，因爲它

最初分散階段肚裏，並計算分佈式詞頻更精確的得分

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html#dfs-query-and-fetch

來源

2014-01-29 10:04:44 knutwalker

查詢類型解決了我的問題。謝謝！ – pierallard

的評分將始終具有小數據組和5個碎片的默認Elasticsearch索引設置警惕。

對於像這樣的測試，使用單個分片的索引或者使用更大的數據集，因此跨語料庫的語料庫分佈更加平衡。

來源

2014-01-29 09:59:56 karmi

彈性搜索中「_id」字段對搜索方法的影響？

回答

相關問題