2013-07-12 23 views
4

我正在嘗試使用Lucene標記並從txt文件中刪除停用詞。我有這樣的:Tokenize,使用Lucene和Java刪除停用詞

public String removeStopWords(String string) throws IOException { 

Set<String> stopWords = new HashSet<String>(); 
    stopWords.add("a"); 
    stopWords.add("an"); 
    stopWords.add("I"); 
    stopWords.add("the"); 

    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string)); 
    tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords); 

    StringBuilder sb = new StringBuilder(); 

    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); 
    while (tokenStream.incrementToken()) { 
     if (sb.length() > 0) { 
      sb.append(" "); 
     } 
     sb.append(token.toString()); 
    System.out.println(sb);  
    } 
    return sb.toString(); 
}} 

我主要如下所示:

String file = "..../datatest.txt"; 

    TestFileReader fr = new TestFileReader(); 
    fr.imports(file); 
    System.out.println(fr.content); 

    String text = fr.content; 

    Stopwords stopwords = new Stopwords(); 
    stopwords.removeStopWords(text); 
    System.out.println(stopwords.removeStopWords(text)); 

這是給我的錯誤,但我想不出爲什麼。

+0

什麼錯誤,你所看到的? – femtoRgon

+0

它抱怨while(tokenStream.incrementToken()) – whyname

回答

0

,你可以嘗試打電話tokenStream.incrementToken之前調用tokenStream.reset()()

8

我有同樣的問題。要使用Lucene刪除停用詞,您可以使用方法EnglishAnalyzer.getDefaultStopSet();使用其默認停止集。否則,您可以創建自己的自定義停用詞列表。

下面的代碼顯示您removeStopWords()的正確版本:使用下列

public static String removeStopWords(String textFile) throws Exception { 
    CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); 
    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim())); 

    tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords); 
    StringBuilder sb = new StringBuilder(); 
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); 
    tokenStream.reset(); 
    while (tokenStream.incrementToken()) { 
     String term = charTermAttribute.toString(); 
     sb.append(term + " "); 
    } 
    return sb.toString(); 
} 

要使用的停用詞自定義列表:

//CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set 
final List<String> stop_Words = Arrays.asList("fox", "the"); 
final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true); 
+0

需要什麼導入以使上述代碼工作? –