什麼是一個好的策略來分類相似的單詞？

說我有電影名稱與拼寫錯誤和小的變化像這樣的列表 -什麼是一個好的策略來分類相似的單詞？

"Pirates of the Caribbean: The Curse of the Black Pearl" 
"Pirates of the carribean" 
"Pirates of the Caribbean: Dead Man's Chest" 
"Pirates of the Caribbean trilogy" 
"Pirates of the Caribbean" 
"Pirates Of The Carribean"

如何組或找到這樣套的話，最好使用python和/或Redis的？

來源

2011-07-05 abc def foo bar

你想得到什麼結果？你想要在整個字符串中查找所有這些變體？ – JMax

我想將這些組合成一個組合對象，並在添加到數據庫時執行檢查。 –

看看「模糊匹配」。下面的線程中的一些很棒的工具可以計算字符串之間的相似度。

我特別喜歡difflib模塊

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy']) 
['apple', 'ape'] 
>>> import keyword 
>>> get_close_matches('wheel', keyword.kwlist) 
['while'] 
>>> get_close_matches('apple', keyword.kwlist) 
[] 
>>> get_close_matches('accept', keyword.kwlist) 
['except']

https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

來源

2011-07-05 07:41:07

鏈接的問題似乎被刪除。看起來好像是 – hardmooth

。當你達到一定程度的分數時，你仍然可以看到已刪除的問題，因此我將鏈接保持原樣。 –

@FredrikPihl可以請你在這裏發佈'get_close_matches'的定義（或者編輯它以答覆）不配得名聲低的農民？ –

爲了另一個提示添加到弗雷德裏克的答案，你也可以得到來自搜索引擎如代碼，像這樣的啓發：

def dosearch(terms, searchtype, case, adddir, files = []): 
    found = [] 
    if files != None: 
     titlesrch = re.compile('>title<.*>/title<') 
     for file in files: 
      title = "" 
      if not (file.lower().endswith("html") or file.lower().endswith("htm")): 
       continue 
      filecontents = open(BASE_DIR + adddir + file, 'r').read() 
      titletmp = titlesrch.search(filecontents) 
      if titletmp != None: 
       title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8] 
      filecontents = remove_tags(filecontents) 
      filecontents = filecontents.lstrip() 
      filecontents = filecontents.rstrip() 
      if dofind(filecontents, case, searchtype, terms) > 0: 
       found.append(title) 
       found.append(file) 
    return found

問候，

最大

來源

2011-07-05 07:50:10 JMax

我相信其實也有兩個不同的問題。

首先是拼寫糾正。你可以有一個在Python這裏

http://norvig.com/spell-correct.html

二是更多的功能。這是我在拼寫更正後要做的事情。我會做一個關係函數。

相關（句子1，句子2）當且僅當句子1和句子2有罕見的常用詞。難得的是，我的意思是不同於（The，what，is等等）。您可以查看TF/IDF系統，以確定兩個文檔是否使用他們的文字相關。只是google搜索了一下，我發現這一點：

https://code.google.com/p/tfidf/

來源

2011-07-05 10:38:08 yogsototh