2013-10-07 69 views
0

我使用演示IndexFiles和SearchFiles類來索引和搜索哪些在org.apache.lucene.demo數據包。Lucene:完全匹配不首先顯示

我的問題是當我使用包含多個單詞的查詢時,我沒有得到具有完全匹配的結果。例如:

Enter query: 
"natural language" 
Searching for: "natural language" 
298 total matching documents 
1. download\researchers.uq.edu.au\fields-of-research\natural-language-processing 
.txt 
2. download\researchers.uq.edu.au\research-project\16267.txt 
3. download\researchers.uq.edu.au\research-project\16279.txt 
4. download\researchers.uq.edu.au\research-project\18361.txt 
5. download\www.uq.edu.au\news\%3Farticle%3D2187.txt 
6. download\researchers.uq.edu.au\researcher\2115.txt 
7. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project 
s-dr-alan-cody%3Fpage%3D1.txt 
8. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project 
s-dr-alan-cody%3Fpage%3D2.txt 
9. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project 
s-dr-alan-cody.txt 
10. download\www.ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-pr 
ojects-dr-alan-cody.txt 
Press (n)ext page, (q)uit or enter number to jump to a page. 

不具有相同的結果:

Enter query: 
natural language 
Searching for: natural language 
54307 total matching documents 
1. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D190.txt 

2. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D576.txt 

3. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D46.txt 
4. download\espace.library.uq.edu.au\view\UQ%3A166163.txt 
5. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D108.txt 

6. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D70.txt 
7. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D708.txt 

8. download\researchers.uq.edu.au\fields-of-research\natural-language-processing 
.txt 
9. download\researchers.uq.edu.au\research-project\16267.txt 
10. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D117.tx 
t 
Press (n)ext page, (q)uit or enter number to jump to a page. 

例如第一個匹配的文件甚至沒有包含「語言」關鍵字。

如果我使用explain()方法中IndexSearcher類,然後我得到這個結果的第一個1:

1. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D190.txt 
0.70643383 = (MATCH) sum of: 
    0.5590494 = (MATCH) weight(contents:natural in 62541) [DefaultSimilarity], result of: 
    0.5590494 = score(doc=62541,freq=4.0 = termFreq=4.0 
), product of: 
     0.8091749 = queryWeight, product of: 
     4.4216847 = idf(docFreq=13111, maxDocs=401502) 
     0.18300149 = queryNorm 
     0.6908882 = fieldWeight in 62541, product of: 
     2.0 = tf(freq=4.0), with freq of: 
      4.0 = termFreq=4.0 
     4.4216847 = idf(docFreq=13111, maxDocs=401502) 
     0.078125 = fieldNorm(doc=62541) 
    0.1473844 = (MATCH) weight(contents:language in 62541) [DefaultSimilarity], result of: 
    0.1473844 = score(doc=62541,freq=1.0 = termFreq=1.0 
), product of: 
     0.5875679 = queryWeight, product of: 
     3.2107275 = idf(docFreq=44012, maxDocs=401502) 
     0.18300149 = queryNorm 
     0.25083807 = fieldWeight in 62541, product of: 
     1.0 = tf(freq=1.0), with freq of: 
      1.0 = termFreq=1.0 
     3.2107275 = idf(docFreq=44012, maxDocs=401502) 
     0.078125 = fieldNorm(doc=62541) 

如果我點擊下一步,找到一個採用這樣:

19. download\www.uq.edu.au\news\%3Farticle%3D2187.txt 
0.47449595 = (MATCH) sum of: 
    0.2795247 = (MATCH) weight(contents:natural in 35173) [DefaultSimilarity], result of: 
    0.2795247 = score(doc=35173,freq=4.0 = termFreq=4.0 
), product of: 
     0.8091749 = queryWeight, product of: 
     4.4216847 = idf(docFreq=13111, maxDocs=401502) 
     0.18300149 = queryNorm 
     0.3454441 = fieldWeight in 35173, product of: 
     2.0 = tf(freq=4.0), with freq of: 
      4.0 = termFreq=4.0 
     4.4216847 = idf(docFreq=13111, maxDocs=401502) 
     0.0390625 = fieldNorm(doc=35173) 
    0.19497125 = (MATCH) weight(contents:language in 35173) [DefaultSimilarity], result of: 
    0.19497125 = score(doc=35173,freq=7.0 = termFreq=7.0 
), product of: 
     0.5875679 = queryWeight, product of: 
     3.2107275 = idf(docFreq=44012, maxDocs=401502) 
     0.18300149 = queryNorm 
     0.33182758 = fieldWeight in 35173, product of: 
     2.6457512 = tf(freq=7.0), with freq of: 
      7.0 = termFreq=7.0 
     3.2107275 = idf(docFreq=44012, maxDocs=401502) 
     0.0390625 = fieldNorm(doc=35173) 

哪一頁本身包含確切的關鍵詞「自然語言」。所以我的問題是:

1)爲什麼Lucene首先不顯示完全匹配?

2)爲什麼Lucene顯示的結果甚至沒有包含關鍵字?

3)在哪裏/如何改變它,以便首先顯示完全匹配的,然後更相關的?

回答

0

1 - 它並不打算。請參閱Lucene query syntax上的文檔。查詢natural language是由兩個術語組成的查詢。對於Lucene而言,它們本身並不偏愛這些術語。如果你想找到精確匹配,短語查詢是正確的做法,像"natural language"

2 - 其中包括你的交代確實包含兩個方面同時匹配結果,請參見:

0.2795247 = (MATCH) weight(contents:natural in 35173) [DefaultSimilarity], result of: 
    0.2795247 = score(doc=35173,freq=4.0 = termFreq=4.0 
... 
0.19497125 = (MATCH) weight(contents:language in 35173) [DefaultSimilarity], result of: 
    0.19497125 = score(doc=35173,freq=7.0 = termFreq=7.0 

根據在Lucene中,它在文檔中找到4次「自然」一詞,在內容字段(我認爲是默認字段)中找到7次「語言」。

3 - 查看查詢語法分析器的語法,以查看對您最有意義的內容。這聽起來像你可能會發現Proximity Searches有用。

如果你只是想簡單地得到短語匹配的後跟別人,你可以使用的東西線沿線的:

"natural language" natural language 
+0

謝謝你,但'接近Search'實際上不會單獨找到的話。所以這不是我之後的情況 –

+0

當然,這就是爲什麼提供了另一種方法的原因,它將單獨的術語查詢與短語查詢相結合,這應該很好地服務。這是不是足夠的是某種方式? – femtoRgon

+0

我已經認爲,實際上,但對於更大的查詢,即使對於4個關鍵字,我也會有3個查詢,其中包含2個措辭,2個查詢,3個措辭和查詢本身。我開發了一種算法來查找需要O(n^3)時間的子查詢,所以如果考慮10個關鍵字查詢,那麼效果不好。 我想知道是否可以將接近搜索與默認搜索結合起來? –