使用lucene的拼寫檢查程序

我正在嘗試使用lucene拼寫檢查程序來編寫拼寫校正程序。我想給它一個包含博客文本內容的文本文件。問題在於，它只在我的字典文件中每行給出一個句子/字時才起作用。建議的API返回的結果沒有給出任何重量級別的出現次數。以下是源代碼使用lucene的拼寫檢查程序

public class SpellCorrector { 

     SpellChecker spellChecker = null; 

     public SpellCorrector() { 
       try { 
         File file = new File("/home/ubuntu/spellCheckIndex"); 
         Directory directory = FSDirectory.open(file); 

         spellChecker = new SpellChecker(directory); 

         StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); 
         IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer); 
         spellChecker.indexDictionary(
             new PlainTextDictionary(new File("/home/ubuntu/main.dictionary")), config, true); 
                     //Should I format this file with one sentence/word per line? 

       } catch (IOException e) { 

       } 

     } 

     public String correct(String query) { 
       if (spellChecker != null) { 
         try { 
           String[] suggestions = spellChecker.suggestSimilar(query, 5); 
           // This returns the suggestion not based on occurence but based on when it occured 

           if (suggestions != null) { 
             if (suggestions.length != 0) { 
               return suggestions[0]; 
             } 
           } 
         } catch (IOException e) { 
           return null; 
         } 
       } 
       return null; 
     } 
}

我需要做一些更改嗎？

來源

2013-03-15 Global Warrior

關於你的第一個問題，聽起來像預期的，記錄的字典格式，這裏的PlainTextDictionary API。如果您想傳入任意文本，您可能需要將其編入索引並使用LuceneDictionary，或者可能使用HighFrequencyDictionary，具體取決於您的需要。

拼寫檢查程序建議根據詞之間的相似性（基於Levenstein Distance），在任何其他問題之前進行替換。如果您希望僅建議更多熱門詞彙，則應通過SuggestMode至SpellChecker.suggestSimilar。這確保了建議的匹配至少與他們打算取代的詞一樣強大，受歡迎。

如果您必須重寫Lucene決定最佳匹配的方式，您可以使用SpellChecker.setComparator來做到這一點，在SuggestWord s上創建您自己的比較器。由於SuggestWord向您展示freq，因此應該很容易按照流行度排列找到的匹配項。

來源

2013-03-15 15:36:29 femtoRgon

使用lucene的拼寫檢查程序

回答

相關問題