2014-09-12 24 views
0

鑑於文檔中的術語匹配,訪問該匹配的單詞的最佳方式是什麼?我已閱讀這篇文章http://searchhub.org//2009/05/26/accessing-words-around-a-positional-match-in-lucene/, ,但問題是,自從這篇文章(2009)後Lucene API完全改變了,是否有人能指向我如何在Lucene的新版本中執行此操作,比如Lucene 4.6.1?在Lucene中圍繞位置匹配訪問文字

編輯

我現在弄清楚了這一點(的貼子的API(TermEnum,TermDocsEnum,TermPositionsEnum)都贊成新的靈活的分度(彈性)的API(字段,FieldsEnum,條款被刪除, TermsEnum,DocsEnum,DocsAndPositionsEnum)。一個很大的區別是現在和術語現在單獨枚舉:TermsEnum在單個字段中提供BytesRef(包裝一個字節[]),而不是Term。另一個是當請求Docs/AndPositionsEnum,您現在明確指定skipDocs(通常這將是已刪除的文檔,但通常您可以提供任何位)):

public class TermVectorFun { 
    public static String[] DOCS = { 
    "The quick red fox jumped over the lazy brown dogs.", 
    "Mary had a little lamb whose fleece was white as snow.", 
    "Moby Dick is a story of a whale and a man obsessed.", 
    "The robber wore a black fleece jacket and a baseball cap.", 
    "The English Springer Spaniel is the best of all dogs.", 
    "The fleece was green and red", 
     "History looks fondly upon the story of the golden fleece, but most people don't agree" 
    }; 

    public static void main(String[] args) throws IOException { 
    RAMDirectory ramDir = new RAMDirectory(); 
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, new StandardAnalyzer(Version.LUCENE_46)); 
    config.setOpenMode(IndexWriterConfig.OpenMode.CREATE); 
    //Index some made up content 
    IndexWriter writer = new IndexWriter(ramDir, config); 
    for (int i = 0; i < DOCS.length; i++) { 
     Document doc = new Document(); 
     Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS); 
     doc.add(id); 
     //Store both position and offset information 
     Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS); 
     doc.add(text); 
     writer.addDocument(doc); 
    } 
    writer.close(); 
    //Get a searcher 

    DirectoryReader dirReader = DirectoryReader.open(ramDir); 
    IndexSearcher searcher = new IndexSearcher(dirReader); 
    // Do a search using SpanQuery 
    SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece")); 
    TopDocs results = searcher.search(fleeceQ, 10); 
    for (int i = 0; i < results.scoreDocs.length; i++) { 
     ScoreDoc scoreDoc = results.scoreDocs[i]; 
     System.out.println("Score Doc: " + scoreDoc); 
    } 
    IndexReader reader = searcher.getIndexReader(); 
    Spans spans = fleeceQ.getSpans(reader.leaves().get(0), null, new LinkedHashMap<Term, TermContext>()); 
    int window = 2;//get the words within two of the match 
    while (spans.next() == true) { 
     int start = spans.start() - window; 
     int end = spans.end() + window; 
     Map<Integer, String> entries = new TreeMap<Integer, String>(); 

     System.out.println("Doc: " + spans.doc() + " Start: " + start + " End: " + end); 
     Fields fields = reader.getTermVectors(spans.doc()); 
     Terms terms = fields.terms("content"); 

     TermsEnum termsEnum = terms.iterator(null); 
     BytesRef text; 
     while((text = termsEnum.next()) != null) {   
     //could store the BytesRef here, but String is easier for this example 
     String s = new String(text.bytes, text.offset, text.length); 
     DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null); 
     if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { 
      int i = 0; 
      int position = -1; 
      while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) { 
      if (position >= start && position <= end) { 
       entries.put(position, s); 
      } 
      i++; 
      } 
     } 
     } 
     System.out.println("Entries:" + entries); 
    } 
    } 
} 

回答

0

使用Highlighter。可以使用Highlighter.getBestFragment來獲取包含最佳匹配的文本的一部分。例如:

TopDocs docs = searcher.search(query, maxdocs); 
Document firstDoc = search.doc(docs.scoreDocs[0].doc); 

Scorer scorer = new QueryScorer(query) 
Highlighter highlighter = new Highlighter(scorer); 
highlighter.GetBestFragment(myAnalyzer, fieldName, firstDoc.get(fieldName)); 
+0

謝謝,但我不認爲我需要一個Highighter類來做到這一點。 – 2014-09-17 13:47:43

+0

當然不是。如果您願意,您可以自己通過返回的文檔自行執行線性搜索。但爲什麼不使用專爲此目的而設計的工具呢? – femtoRgon 2014-09-17 14:45:03

+0

是的,你是對的,我曾試過你的解決方案,即使搜索文本被阻止。通過你的解決方案,我仍然可以得到匹配的話,謝謝! – 2014-09-18 20:42:42