2009-08-21 21 views
4

使用Lucene,在搜索結果中找到匹配的推薦方法是什麼?從Lucene找到搜索匹配的位置

更具體地說,假設索引文檔有一個字段「fullText」,它存儲了一些文檔的純文本內容。此外,假設其中一個文件的內容是「快速的棕色狐狸跳過懶惰的狗」。接下來搜索「狐狸狗」。很顯然,這份文件將會很受歡迎。

在這種情況下,Lucene可以用來爲找到的文檔提供類似的匹配區域嗎?因此,對於這種情況下我想產生類似:

[{match: "fox", startIndex: 10, length: 3}, 
{match: "dog", startIndex: 34, length: 3}] 

我懷疑它可以通過什麼在org.apache.lucene.search.highlight包提供實現。我不知道總體方法,但...

回答

7

TermFreqVector是我用過的。這是一個工作演示,這兩個打印的長期立場,以及起點和終點的長期指標:

public class Search { 
    public static void main(String[] args) throws IOException, ParseException { 
     Search s = new Search(); 
     s.doSearch(args[0], args[1]); 
    } 

    Search() { 
    } 

    public void doSearch(String db, String querystr) throws IOException, ParseException { 
     // 1. Specify the analyzer for tokenizing text. 
     // The same analyzer should be used as was used for indexing 
     StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); 

     Directory index = FSDirectory.open(new File(db)); 

     // 2. query 
     Query q = new QueryParser(Version.LUCENE_CURRENT, "contents", analyzer).parse(querystr); 

     // 3. search 
     int hitsPerPage = 10; 
     IndexSearcher searcher = new IndexSearcher(index, true); 
     IndexReader reader = IndexReader.open(index, true); 
     searcher.setDefaultFieldSortScoring(true, false); 
     TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true); 
     searcher.search(q, collector); 
     ScoreDoc[] hits = collector.topDocs().scoreDocs; 

     // 4. display term positions, and term indexes 
     System.out.println("Found " + hits.length + " hits."); 
     for(int i=0;i<hits.length;++i) { 

      int docId = hits[i].doc; 
      TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents"); 
      TermPositionVector tpvector = (TermPositionVector)tfvector; 
      // this part works only if there is one term in the query string, 
      // otherwise you will have to iterate this section over the query terms. 
      int termidx = tfvector.indexOf(querystr); 
      int[] termposx = tpvector.getTermPositions(termidx); 
      TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx); 

      for (int j=0;j<termposx.length;j++) { 
       System.out.println("termpos : "+termposx[j]); 
      } 
      for (int j=0;j<tvoffsetinfo.length;j++) { 
       int offsetStart = tvoffsetinfo[j].getStartOffset(); 
       int offsetEnd = tvoffsetinfo[j].getEndOffset(); 
       System.out.println("offsets : "+offsetStart+" "+offsetEnd); 
      } 

      // print some info about where the hit was found... 
      Document d = searcher.doc(docId); 
      System.out.println((i + 1) + ". " + d.get("path")); 
     } 

     // searcher can only be closed when there 
     // is no need to access the documents any more. 
     searcher.close(); 
    }  
} 
+0

註釋「這部分只是如果在THR查詢字符串一個長期的工作。」我的下一個問題是:如何找到查詢匹配的條件(如果它是一個複雜的查詢(例如使用通配符)。這個答案很好地填補了這個空白:http://stackoverflow.com/questions/7896183/get-matched- terms-from-lucene-query – geert3 2013-09-05 09:58:28

2

這裏是Lucene的5.2.1的解決方案。它僅適用於單個單詞查詢,但應顯示基本原則。

的基本思路是:

  1. 找一個TokenStream爲每個文檔,你的查詢相匹配。
  2. 創建一個QueryScorer並用檢索到的tokenStream進行初始化。
  3. 在流的每個標記上循環(由tokenStream.incrementToken()完成)並檢查標記是否與搜索標準匹配(由queryScorer.getTokenScore()完成)。

下面是代碼:

import java.io.IOException; 
import java.util.List; 
import java.util.Vector; 

import org.apache.lucene.analysis.TokenStream; 
import org.apache.lucene.analysis.de.GermanAnalyzer; 
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; 
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; 
import org.apache.lucene.document.Document; 
import org.apache.lucene.index.DirectoryReader; 
import org.apache.lucene.index.IndexReader; 
import org.apache.lucene.index.IndexWriter; 
import org.apache.lucene.search.IndexSearcher; 
import org.apache.lucene.search.Query; 
import org.apache.lucene.search.ScoreDoc; 
import org.apache.lucene.search.TopDocs; 
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException; 
import org.apache.lucene.search.highlight.QueryScorer; 
import org.apache.lucene.search.highlight.TokenSources; 

public class OffsetSearcher { 

    private IndexReader reader; 

    public OffsetSearcher(IndexWriter indexWriter) throws IOException { 
     reader = DirectoryReader.open(indexWriter, true); 
    } 

    public OffsetData[] getTermOffsets(Query query) throws IOException, InvalidTokenOffsetsException 
    { 
     List<OffsetData> result = new Vector<>(); 

     IndexSearcher searcher = new IndexSearcher(reader); 
     TopDocs topDocs = searcher.search(query, 1000); 

     ScoreDoc[] scoreDocs = topDocs.scoreDocs; 

     Document doc; 
     TokenStream tokenStream; 
     CharTermAttribute termAtt; 
     OffsetAttribute offsetAtt; 
     QueryScorer queryScorer; 
     OffsetData offsetData; 
     String txt, tokenText; 
     for (int i = 0; i < scoreDocs.length; i++) 
     { 
      int docId = scoreDocs[i].doc; 
      doc = reader.document(docId); 

      txt = doc.get(RunSearch.CONTENT); 
      tokenStream = TokenSources.getTokenStream(RunSearch.CONTENT, reader.getTermVectors(docId), txt, new GermanAnalyzer(), -1); 

      termAtt = (CharTermAttribute)tokenStream.addAttribute(CharTermAttribute.class); 
      offsetAtt = (OffsetAttribute)tokenStream.addAttribute(OffsetAttribute.class); 

      queryScorer = new QueryScorer(query); 
      queryScorer.setMaxDocCharsToAnalyze(RunSearch.MAX_DOC_CHARS); 
      TokenStream newStream = queryScorer.init(tokenStream); 
      if (newStream != null) { 
       tokenStream = newStream; 
      } 
      queryScorer.startFragment(null); 

      tokenStream.reset(); 

      int startOffset, endOffset; 
      for (boolean next = tokenStream.incrementToken(); next && (offsetAtt.startOffset() < RunSearch.MAX_DOC_CHARS); next = tokenStream.incrementToken()) 
      { 
       startOffset = offsetAtt.startOffset(); 
       endOffset = offsetAtt.endOffset(); 

       if ((endOffset > txt.length()) || (startOffset > txt.length())) 
       { 
        throw new InvalidTokenOffsetsException("Token " + termAtt.toString() + " exceeds length of provided text sized " + txt.length()); 
       } 

       float res = queryScorer.getTokenScore(); 
       if (res > 0.0F && startOffset <= endOffset) { 
        tokenText = txt.substring(startOffset, endOffset); 
        offsetData = new OffsetData(tokenText, startOffset, endOffset, docId); 
        result.add(offsetData); 
       }   
      } 
     } 

     return result.toArray(new OffsetData[result.size()]); 
    } 


    public void close() throws IOException { 
     reader.close(); 
    } 


    public static class OffsetData { 

     public String phrase; 
     public int startOffset; 
     public int endOffset; 
     public int docId; 

     public OffsetData(String phrase, int startOffset, int endOffset, int docId) { 
      super(); 
      this.phrase = phrase; 
      this.startOffset = startOffset; 
      this.endOffset = endOffset; 
      this.docId = docId; 
     } 

    } 

} 
+0

您能否告訴我們如何實現多項查詢?@matthiasboesinger – Heidar 2016-11-10 15:43:15