Tokenize，使用Lucene和Java刪除停用詞

我正在嘗試使用Lucene標記並從txt文件中刪除停用詞。我有這樣的：Tokenize，使用Lucene和Java刪除停用詞

public String removeStopWords(String string) throws IOException { 

Set<String> stopWords = new HashSet<String>(); 
    stopWords.add("a"); 
    stopWords.add("an"); 
    stopWords.add("I"); 
    stopWords.add("the"); 

    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string)); 
    tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords); 

    StringBuilder sb = new StringBuilder(); 

    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); 
    while (tokenStream.incrementToken()) { 
     if (sb.length() > 0) { 
      sb.append(" "); 
     } 
     sb.append(token.toString()); 
    System.out.println(sb);  
    } 
    return sb.toString(); 
}}

我主要如下所示：

String file = "..../datatest.txt"; 

    TestFileReader fr = new TestFileReader(); 
    fr.imports(file); 
    System.out.println(fr.content); 

    String text = fr.content; 

    Stopwords stopwords = new Stopwords(); 
    stopwords.removeStopWords(text); 
    System.out.println(stopwords.removeStopWords(text));

這是給我的錯誤，但我想不出爲什麼。

來源

2013-07-12 whyname

什麼錯誤，你所看到的？ – femtoRgon

它抱怨while（tokenStream.incrementToken（）） – whyname

，你可以嘗試打電話tokenStream.incrementToken之前調用tokenStream.reset（）（）

來源

2014-03-02 07:11:06 user3370153

我有同樣的問題。要使用Lucene刪除停用詞，您可以使用方法EnglishAnalyzer.getDefaultStopSet();使用其默認停止集。否則，您可以創建自己的自定義停用詞列表。

下面的代碼顯示您removeStopWords()的正確版本：使用下列

public static String removeStopWords(String textFile) throws Exception { 
    CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); 
    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim())); 

    tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords); 
    StringBuilder sb = new StringBuilder(); 
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); 
    tokenStream.reset(); 
    while (tokenStream.incrementToken()) { 
     String term = charTermAttribute.toString(); 
     sb.append(term + " "); 
    } 
    return sb.toString(); 
}

要使用的停用詞自定義列表：

//CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set 
final List<String> stop_Words = Arrays.asList("fox", "the"); 
final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);

來源

2014-05-16 15:54:15 user692704

需要什麼導入以使上述代碼工作？ –

Tokenize，使用Lucene和Java刪除停用詞

回答

相關問題