我有一些文字：有沒有一種簡單的方法來生成一個可能的單詞從python中的一個未分類的句子？

s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

我想分析成獨立的單詞本。我迅速看了看附魔和nltk，但沒有看到任何看起來立即有用的東西。如果我有時間在這方面進行投資，我會考慮編寫一個有附魔能力的動態程序來檢查一個詞是不是英語。我原以爲會有東西在網上做，我錯了嗎？

來源

2013-03-12 Erotemic

您可以將單詞詞典編碼爲trie，並使用貪婪算法：拉出匹配的最長單詞，然後繼續下一個單詞，在失敗時回溯。可能不是最佳。試試這個數據結構的建議：http://kmike.ru/python-data-structures/ – hughdbrown 2013-03-12 15:14:55

有趣的問題。我猜想答案（「簡單的方法」）將是「否」。 – 2013-03-12 15:15:03

之前問過的類似問題沒有太多運氣：http://stackoverflow.com/questions/13034330/how-to-separate-an-engilsh-language-string-without-spaces-to-form-some-meaningfu – 2013-03-12 15:15:54

貪婪的方法使用特里

試試這個使用Biopython（pip install biopython）：

from Bio import trie 
import string 


def get_trie(dictfile='/usr/share/dict/american-english'): 
    tr = trie.trie() 
    with open(dictfile) as f: 
     for line in f: 
      word = line.rstrip() 
      try: 
       word = word.encode(encoding='ascii', errors='ignore') 
       tr[word] = len(word) 
       assert tr.has_key(word), "Missing %s" % word 
      except UnicodeDecodeError: 
       pass 
    return tr 


def get_trie_word(tr, s): 
    for end in reversed(range(len(s))): 
     word = s[:end + 1] 
     if tr.has_key(word): 
      return word, s[end + 1: ] 
    return None, s 

def main(s): 
    tr = get_trie() 
    while s: 
     word, s = get_trie_word(tr, s) 
     print word 

if __name__ == '__main__': 
    s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:" 
    s = s.strip(string.punctuation) 
    s = s.replace(" ", '') 
    s = s.lower() 
    main(s)

結果

>>> if __name__ == '__main__': 
...  s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:" 
...  s = s.strip(string.punctuation) 
...  s = s.replace(" ", '') 
...  s = s.lower() 
...  main(s) 
... 
image 
classification 
methods 
can 
be 
roughly 
divided 
into 
two 
broad 
families 
of 
approaches

注意事項

有退化的情況下，在英語，這將不會爲。。。工作。你需要使用回溯來處理這些，但這應該讓你開始。

強制性測試

>>> main("expertsexchange") 
experts 
exchange

來源

2013-03-12 17:00:50 hughdbrown

精彩。這正是我想要的！ – Erotemic 2013-03-14 15:44:25

這是那種在亞洲NLP經常發生的問題。如果你有字典，那麼你可以使用這個http://code.google.com/p/mini-segmenter/（免責聲明：我寫了它，希望你不介意）。

請注意，搜索空間可能非常大，因爲英文字母的字符數肯定比音節的中文/日文長。

來源

2013-03-13 22:25:32 alvas

有沒有一種簡單的方法來生成一個可能的單詞從python中的一個未分類的句子？

回答

貪婪的方法使用特里

結果

注意事項

強制性測試

相關問題