我目前使用Lucene作爲我們的全文搜索引擎。但是我們需要根據特定的字段對搜索結果進行排序。調整Lucene搜索結果得分按重量特定字段同名
例如,如果我們的索引中包含以下三個文檔,其中除了id
字段之外的內容都完全相同。
val document01 = new Document()
val field0100 = new Field("id", "1", Field.Store.YES, Field.Index.ANALYZED)
val field0101 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED)
val field0102 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED)
document01.add(field0100)
document01.add(field0101)
document01.add(field0102)
val document02 = new Document()
val field0200 = new Field("id", "2", Field.Store.YES, Field.Index.ANALYZED)
val field0201 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED)
val field0202 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED)
document02.add(field0200)
document02.add(field0201)
document02.add(field0202)
val document03 = new Document()
val field0300 = new Field("id", "3", Field.Store.YES, Field.Index.ANALYZED)
val field0301 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED)
val field0302 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED)
document03.add(field0300)
document03.add(field0301)
document03.add(field0302)
現在,當我使用IndexSearcher的搜索Linux
,我得到了以下結果:
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
當我搜索Windows
,我得到相同的排序相同的結果。
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
問題是可以在構建索引時權重特定的字段嗎?例如,如果匹配搜索時,我想讓field0201
得分較高。
換句話說,當我搜索Linux
,我想得到的結果按以下順序:
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
當我搜索Windows
,它仍然是原來的排序,如下所示:
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
我試着用field0201.setBoost()
,但它會改變搜索排序的結果既當我搜索Linux
或Windows
。
它看起來像文件都包含除ID以外的相同數據。爲什麼你會期望得分不同? – huynhjl 2011-04-14 02:38:11
@huynhjil因爲內容來自不同的來源。如果與搜索字詞匹配,我希望來自特定來源的字段得分較高。換句話說,它應該與使用(得分lucene計算,場源)對進行排序。 – 2011-04-14 02:44:22
您是否可以使用傳遞給TopFieldCollector的Sort實例進行排序? ......或者你是否明確地想要用你的領域的分數來做到這一點(只有在內容不一致的情況下才會有效)? – csupnig 2011-04-14 07:03:43