2014-04-10 97 views
2

我正在使用lucene 4.7並嘗試遷移我們在solr配置中使用的其中一個分析器。用於索引和查詢的Lucene自定義分析器

<analyzer> 
    <charFilter class="solr.HTMLStripCharFilterFactory"/> 
    <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> 
    <filter class="solr.WordDelimiterFilterFactory" 
      generateWordParts="1" 
      generateNumberParts="1" 
      catenateWords="1" 
      catenateNumbers="1" 
      catenateAll="0" 
      splitOnCaseChange="0" 
      splitOnNumerics="0" 
      preserveOriginal="1" 
    /> 
    <filter class="solr.LowerCaseFilterFactory"/> 
    <filter class="solr.PorterStemFilterFactory"/> 
    </analyzer> 

但是,我只是無法弄清楚如何使用HTMLStripCharFilterFactory和WordDelimiterFilterFactory與上述配置。另外,我的分析儀在我的分析儀中的查詢如下,我怎麼能在lucene中實現相同。

<analyzer type="query"> 
    <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
    <filter class="solr.StopFilterFactory" 
      ignoreCase="true" 
      words="stopwords.txt" 
      /> 
    <filter class="solr.LowerCaseFilterFactory"/> 
    <filter class="solr.PorterStemFilterFactory"/> 
    </analyzer> 

回答

5

Analysis package documentation解釋瞭如何使用CharFilter。您可以將其包裝在覆蓋的initReader方法中。

我假設你的WordDelimiterFilter的問題是你不知道如何設置你正在使用的配置選項?通過將適當的常量與二進制文件相結合,構建一個int以傳遞給構造函數,並且(&)。如:

//StopwordAnalyzerBase grants you some convenient ways to handle stop word sets. 
public class MyAnalyzer extends StopwordAnalyzerBase { 

    private final Version version = Version.LUCENE_47; 
    private int wordDelimiterConfig; 

    public MyAnalyzer() throws IOException { 
     super(version, loadStopwordSet(new FileReader("stopwords.txt"), matchVersion)); 
     //Might as well load this config up front, along with the stop words 
     wordDelimiterConfig = 
      WordDelimiterFilter.GENERATE_WORD_PARTS & 
      WordDelimiterFilter.GENERATE_NUMBER_PARTS & 
      WordDelimiterFilter.CATENATE_WORDS & 
      WordDelimiterFilter.CATENATE_NUMBERS & 
      WordDelimiterFilter.PRESERVE_ORIGINAL; 
    } 

    @Override 
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) { 
     Tokenizer source = new WhitespaceTokenizer(version, reader); 
     TokenStream filter = new WordDelimiterFilter(source, wordDelimiterConfig, null); 
     filter = new LowercaseFilterFactory(version, filter); 
     filter = new StopFilter(version, filter, stopwords); 
     filter = new PorterStemFilter(filter); 
     return new TokenStreamComponents(source, filter); 
    } 

    @Override 
    protected Reader initReader(String fieldName, Reader reader) { 
     return new HTMLStripCharFilter(reader); 
    } 
} 

注:

int config = WordDelimiterFilter.GENERATE_NUMBER_PARTS & WordDelimiterFilter.GENERATE_WORD_PARTS; //etc. 

那麼,到底你可能會喜歡的東西最終我LowercaseFilter後移動到StopFilter。這使得它不區分大小寫,只要你的停用詞定義都是小寫的。不知道這是否是由於WordDelimiterFilter造成的問題。如果是這樣,有一個loadStopwordSet method that support case insensitivity,但坦率地說,我不知道如何使用它。