2013-10-16 15 views
0

我目前正在爲Mahout集羣項目開發自定義分析器。由於Mahout 0.8將Lucene更新爲4.3,因此無法從書籍過時的示例中生成標記化文檔文件或SequenceFile。以下代碼是我對本書Mahout in Action中的示例代碼的修訂。但是,它給了我非法的態度。Mahout在Lucene 4.3的Action Analyzer中遇到麻煩

public class MyAnalyzer extends Analyzer { 

private final Pattern alphabets = Pattern.compile("[a-z]+"); 
Version version = Version.LUCENE_43; 

@Override 
protected TokenStreamComponents createComponents(String fieldName, Reader reader) { 
    Tokenizer source = new StandardTokenizer(version, reader); 
    TokenStream filter = new StandardFilter(version, source); 

    filter = new LowerCaseFilter(version, filter); 
    filter = new StopFilter(version, filter, StandardAnalyzer.STOP_WORDS_SET); 

    CharTermAttribute termAtt = (CharTermAttribute)filter.addAttribute(CharTermAttribute.class); 
    StringBuilder buf = new StringBuilder(); 

    try { 

     filter.reset(); 
     while(filter.incrementToken()){ 
      if(termAtt.length()>10){ 
       continue; 
      } 
      String word = new String(termAtt.buffer(), 0, termAtt.length()); 
      Matcher matcher = alphabets.matcher(word); 
      if(matcher.matches()){ 
       buf.append(word).append(" "); 
      } 
     } 
    } catch (IOException e) { 
     e.printStackTrace(); 
    } 
    source = new WhitespaceTokenizer(version, new StringReader(buf.toString())); 

    return new TokenStreamComponents(source, filter); 

} 

}

回答

0

不明白爲什麼你有一個IllegalStateException,但也有一些可能的可能性。通常你的分析器將在標記器之上構建過濾器。你這樣做,然後創建另一個標記器並將其傳回,所以傳回的過濾器與標記器沒有直接關係。而且,你構建的過濾器在傳回時已經結束了,所以你可以嘗試一下它,我想。

雖然主要的問題是createComponents並不是真正實現解析邏輯的好地方。這是你設置Tokenizer和一堆過濾器來做到這一點。在過濾器中實現您的自定義過濾邏輯會更有意義,擴展TokenStream(或AttributeSource,或其他)。

我想你在找什麼已經實現,不過,在PatternReplaceCharFilter

private final Pattern nonAlpha = Pattern.compile(".*[^a-z].*"); 
@Override 
protected TokenStreamComponents createComponents(String fieldName, Reader reader) { 
    Tokenizer source = new StandardTokenizer(version, reader); 
    TokenStream filter = new StandardFilter(version, source); 
    filter = new LowerCaseFilter(version, filter); 
    filter = new StopFilter(version, filter, StandardAnalyzer.STOP_WORDS_SET); 
    filter = new PatternReplaceCharFilter(nonAlpha, "", filter); 
    return new TokenStreamComponents(source, filter); 
} 

或者是一些更加簡單,像這樣,將有助於:

@Override 
protected TokenStreamComponents createComponents(String fieldName, Reader reader) { 
    Tokenizer source = new LowerCaseTokenizer(version, reader); 
    TokenStream filter = new StopFilter(version, filter, StandardAnalyzer.STOP_WORDS_SET); 
    return new TokenStreamComponents(source, filter); 
} 
+0

如果我想實現某些不屬於Lucene庫的過濾器,並像書籍作者那樣使用CharTermAttribute,我如何在分析器中定製它? – Jason