我正在嘗試使用Lucene標記並從txt文件中刪除停用詞。我有這樣的:Tokenize,使用Lucene和Java刪除停用詞
public String removeStopWords(String string) throws IOException {
Set<String> stopWords = new HashSet<String>();
stopWords.add("a");
stopWords.add("an");
stopWords.add("I");
stopWords.add("the");
TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string));
tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords);
StringBuilder sb = new StringBuilder();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
if (sb.length() > 0) {
sb.append(" ");
}
sb.append(token.toString());
System.out.println(sb);
}
return sb.toString();
}}
我主要如下所示:
String file = "..../datatest.txt";
TestFileReader fr = new TestFileReader();
fr.imports(file);
System.out.println(fr.content);
String text = fr.content;
Stopwords stopwords = new Stopwords();
stopwords.removeStopWords(text);
System.out.println(stopwords.removeStopWords(text));
這是給我的錯誤,但我想不出爲什麼。
什麼錯誤,你所看到的? – femtoRgon
它抱怨while(tokenStream.incrementToken()) – whyname