2012-12-18 187 views
15

我對Python和NLTK相當新穎。我忙於一個可以執行拼寫檢查的應用程序(用正確拼寫的單詞替換拼寫錯誤的單詞), 我目前使用Python-2.7上的Enchant Library,PyEnchant和NLTK庫。下面的代碼是處理更正/替換的類。Python的拼寫檢查器

from nltk.metrics import edit_distance 

class SpellingReplacer(object): 
    def __init__(self, dict_name = 'en_GB', max_dist = 2): 
     self.spell_dict = enchant.Dict(dict_name) 
     self.max_dist = 2 

    def replace(self, word): 
     if self.spell_dict.check(word): 
      return word 
     suggestions = self.spell_dict.suggest(word) 

     if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist: 
      return suggestions[0] 
     else: 
      return word 

我寫了一個函數,它在單詞的列表,並進行高清替換每個單詞和返回的單詞的列表,但拼寫正確。

def spell_check(word_list): 
    checked_list = [] 
    for item in word_list: 
     replacer = SpellingReplacer() 
     r = replacer.replace(item) 
     checked_list.append(r) 
    return checked_list 

>>> word_list = ['car', 'colour'] 
>>> spell_check(words) 
['car', 'color'] 

現在我真的不喜歡這一點,因爲它不是很準確,我正在尋找一種方式來實現對單詞的拼寫檢查和更換。我還需要一些可以解決「caaaar」這樣的拼寫錯誤的東西?有更好的方法來執行拼寫檢查嗎?如果是的話,他們是什麼? Google如何做這件事,因爲他們的拼寫建議者非常好? 任何建議

回答

17

我建議首先仔細閱讀this post by Peter Norvig。 (我得到了類似的東西,我發現它非常有用。)

以下函數特別具有您現在需要使拼寫檢查器更復雜的想法:分割,刪除,轉置和插入不規則詞以「糾正」它們。

def edits1(word): 
    splits  = [(word[:i], word[i:]) for i in range(len(word) + 1)] 
    deletes = [a + b[1:] for a, b in splits if b] 
    transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] 
    replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] 
    inserts = [a + c + b  for a, b in splits for c in alphabet] 
    return set(deletes + transposes + replaces + inserts) 

注:以上是從弱勢族羣的拼寫校正

好消息一個片段是,你可以逐步增加,不斷提高你的拼寫檢查。

希望有所幫助。

0

拼寫修正器>

您需要在導入語料庫到你的桌面,如果你存儲在其他位置更改代碼的路徑我已經加了幾個圖形,以及使用Tkinter的,這是唯一的解決非字錯誤!

def min_edit_dist(word1,word2): 
    len_1=len(word1) 
    len_2=len(word2) 
    x = [[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance 
    for i in range(0,len_1+1): 
     #initialization of base case values 
     x[i][0]=i 
     for j in range(0,len_2+1): 
      x[0][j]=j 
    for i in range (1,len_1+1): 
     for j in range(1,len_2+1): 
      if word1[i-1]==word2[j-1]: 
       x[i][j] = x[i-1][j-1] 
      else : 
       x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1 
    return x[i][j] 
from Tkinter import * 


def retrieve_text(): 
    global word1 
    word1=(app_entry.get()) 
    path="C:\Documents and Settings\Owner\Desktop\Dictionary.txt" 
    ffile=open(path,'r') 
    lines=ffile.readlines() 
    distance_list=[] 
    print "Suggestions coming right up count till 10" 
    for i in range(0,58109): 
     dist=min_edit_dist(word1,lines[i]) 
     distance_list.append(dist) 
    for j in range(0,58109): 
     if distance_list[j]<=2: 
      print lines[j] 
      print" " 
    ffile.close() 
if __name__ == "__main__": 
    app_win = Tk() 
    app_win.title("spell") 
    app_label = Label(app_win, text="Enter the incorrect word") 
    app_label.pack() 
    app_entry = Entry(app_win) 
    app_entry.pack() 
    app_button = Button(app_win, text="Get Suggestions", command=retrieve_text) 
    app_button.pack() 
    # Initialize GUI loop 
    app_win.mainloop() 
0

可以使用autocorrect LIB拼寫檢查蟒蛇。
實例應用:

from autocorrect import spell 

print spell('caaaar') 
print spell(u'mussage') 
print spell(u'survice') 
print spell(u'hte') 

結果:

caesar 
message 
service 
the