0
我使用Lucene來計算單詞(請參閱下面的示例)。Lucene自定義TokenStream
我的問題是我該如何在Lucene中設置自己的過濾器?例如,添加我的自定義StopFilter,ShingleFilter等。
我想一些令牌流過濾器已經被使用,因爲你好,你好,你好,HELLO被轉換爲「你好」。
public class CountWordsExample {
public static void main(String[] args) throws IOException {
RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(
Version.LUCENE_47, new StandardAnalyzer(Version.LUCENE_47)));
Document document = new Document();
document.add(new TextField("foo", "Hello hello how are you", Store.YES));
document.add(new TextField("foo", "hello how are you", Store.YES));
document.add(new TextField("foo", "HELLO", Store.YES));
writer.addDocument(document);
writer.commit();
writer.close(true);
// ShingleFilter shingle = new ShingleFilter(input);
IndexReader indexReader = DirectoryReader.open(directory);
Bits liveDocs = MultiFields.getLiveDocs(indexReader);
Fields fields = MultiFields.getFields(indexReader);
for (String field : fields) {
TermsEnum termEnum = MultiFields.getTerms(indexReader, field)
.iterator(null);
BytesRef bytesRef;
while ((bytesRef = termEnum.next()) != null) {
if (termEnum.seekExact(bytesRef)) {
DocsEnum docsEnum = termEnum.docs(liveDocs, null);
if (docsEnum != null) {
int doc;
while ((doc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
System.out
.println(bytesRef.utf8ToString()
+ " in doc " + doc + ": "
+ docsEnum.freq());
}
}
}
}
}
for (String field : fields) {
TermsEnum termEnum = MultiFields.getTerms(indexReader, field)
.iterator(null);
BytesRef bytesRef;
while ((bytesRef = termEnum.next()) != null) {
int freq = indexReader.docFreq(new Term(field, bytesRef));
System.out.println(bytesRef.utf8ToString() + " in " + freq
+ " documents");
}
}
}
}
輸出:
hello in doc 0: 4
how in doc 0: 2
you in doc 0: 2
hello in 1 documents
how in 1 documents
you in 1 documents
看起來像舊代碼給我。在4.0以後將無法使用(您正在使用)。您需要重寫'createComponents',請參閱['Analyzer'文檔](http://lucene.apache.org/core/4_7_0/core/org/apache/lucene/analysis/Analyzer.html)以獲取例。 – femtoRgon
對不起,修正了。現在它應該滿足4.7的要求。 –
好的,你爲什麼使用兩個不同的標記器?我從未見過這樣做過。通常情況下,你會想要保存過濾器鏈中使用的標記器(在本例中爲「StandardTokenizer」並將其傳遞到你的TokenStreamComponents中)。是否有某件事是要完成的? – femtoRgon