我目前正在爲Mahout集羣項目開發自定義分析器。由於Mahout 0.8將Lucene更新爲4.3,因此無法從書籍過時的示例中生成標記化文檔文件或SequenceFile。以下代碼是我對本書Mahout in Action中的示例代碼的修訂。但是,它給了我非法的態度。Mahout在Lucene 4.3的Action Analyzer中遇到麻煩
public class MyAnalyzer extends Analyzer {
private final Pattern alphabets = Pattern.compile("[a-z]+");
Version version = Version.LUCENE_43;
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new StandardTokenizer(version, reader);
TokenStream filter = new StandardFilter(version, source);
filter = new LowerCaseFilter(version, filter);
filter = new StopFilter(version, filter, StandardAnalyzer.STOP_WORDS_SET);
CharTermAttribute termAtt = (CharTermAttribute)filter.addAttribute(CharTermAttribute.class);
StringBuilder buf = new StringBuilder();
try {
filter.reset();
while(filter.incrementToken()){
if(termAtt.length()>10){
continue;
}
String word = new String(termAtt.buffer(), 0, termAtt.length());
Matcher matcher = alphabets.matcher(word);
if(matcher.matches()){
buf.append(word).append(" ");
}
}
} catch (IOException e) {
e.printStackTrace();
}
source = new WhitespaceTokenizer(version, new StringReader(buf.toString()));
return new TokenStreamComponents(source, filter);
}
}
如果我想實現某些不屬於Lucene庫的過濾器,並像書籍作者那樣使用CharTermAttribute,我如何在分析器中定製它? – Jason