2014-05-08 64 views
0

我使用Lucene來計算單詞(請參閱下面的示例)。Lucene自定義TokenStream

我的問題是我該如何在Lucene中設置自己的過濾器?例如,添加我的自定義StopFilter,ShingleFilter等。

我想一些令牌流過濾器已經被使用,因爲你好,你好,你好,HELLO被轉換爲「你好」。

public class CountWordsExample { 
public static void main(String[] args) throws IOException { 
    RAMDirectory directory = new RAMDirectory(); 
    IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(
      Version.LUCENE_47, new StandardAnalyzer(Version.LUCENE_47))); 
    Document document = new Document(); 
    document.add(new TextField("foo", "Hello hello how are you", Store.YES)); 
    document.add(new TextField("foo", "hello how are you", Store.YES)); 
    document.add(new TextField("foo", "HELLO", Store.YES)); 
    writer.addDocument(document); 
    writer.commit(); 
    writer.close(true); 

    // ShingleFilter shingle = new ShingleFilter(input); 

    IndexReader indexReader = DirectoryReader.open(directory); 

    Bits liveDocs = MultiFields.getLiveDocs(indexReader); 
    Fields fields = MultiFields.getFields(indexReader); 
    for (String field : fields) { 
     TermsEnum termEnum = MultiFields.getTerms(indexReader, field) 
       .iterator(null); 
     BytesRef bytesRef; 
     while ((bytesRef = termEnum.next()) != null) { 
      if (termEnum.seekExact(bytesRef)) { 

       DocsEnum docsEnum = termEnum.docs(liveDocs, null); 

       if (docsEnum != null) { 
        int doc; 
        while ((doc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { 
         System.out 
           .println(bytesRef.utf8ToString() 
             + " in doc " + doc + ": " 
             + docsEnum.freq()); 
        } 
       } 
      } 
     } 
    } 
    for (String field : fields) { 
     TermsEnum termEnum = MultiFields.getTerms(indexReader, field) 
       .iterator(null); 
     BytesRef bytesRef; 
     while ((bytesRef = termEnum.next()) != null) { 
      int freq = indexReader.docFreq(new Term(field, bytesRef)); 

      System.out.println(bytesRef.utf8ToString() + " in " + freq 
        + " documents"); 

     } 
    } 
} 

}

輸出:

hello in doc 0: 4 
how in doc 0: 2 
you in doc 0: 2 
hello in 1 documents 
how in 1 documents 
you in 1 documents 

回答

-1

因此,答案是很簡單的。如何定義我自己的令牌處理的方式是定義我自己的分析器。例如:

import java.io.Reader; 

import org.apache.lucene.analysis.Analyzer; 
import org.apache.lucene.analysis.TokenStream; 
import org.apache.lucene.analysis.Tokenizer; 
import org.apache.lucene.analysis.core.LowerCaseFilter; 
import org.apache.lucene.analysis.core.WhitespaceTokenizer; 
import org.apache.lucene.analysis.standard.StandardTokenizer; 
import org.apache.lucene.util.Version; 

public class NGramAnalyzer extends Analyzer { 


    @Override 
    protected TokenStreamComponents createComponents(String fieldName, 
      Reader reader) { 
     TokenStream f = new StandardTokenizer(Version.LUCENE_47, reader); 
     f = new LowerCaseFilter(Version.LUCENE_47, f); 

     Tokenizer source = new StandardTokenizer(Version.LUCENE_47, reader); 
     return new TokenStreamComponents(source, f); 
    } 
} 
+0

看起來像舊代碼給我。在4.0以後將無法使用(您正在使用)。您需要重寫'createComponents',請參閱['Analyzer'文檔](http://lucene.apache.org/core/4_7_0/core/org/apache/lucene/analysis/Analyzer.html)以獲取例。 – femtoRgon

+0

對不起,修正了。現在它應該滿足4.7的要求。 –

+0

好的,你爲什麼使用兩個不同的標記器?我從未見過這樣做過。通常情況下,你會想要保存過濾器鏈中使用的標記器(在本例中爲「StandardTokenizer」並將其傳遞到你的TokenStreamComponents中)。是否有某件事是要完成的? – femtoRgon