從Python NLTK或其他模塊中獲取任何單詞的音素？

Python NLTK有cmudict吐出識別單詞的音素。例如'see' - > [u'S'，u'IY1']，但對於無法識別的單詞會產生錯誤。例如'seasee' - >錯誤。從Python NLTK或其他模塊中獲取任何單詞的音素？

import nltk 

arpabet = nltk.corpus.cmudict.dict() 

for word in ('s', 'see', 'sea', 'compute', 'comput', 'seesea'): 
    try: 
     print arpabet[word][0] 
    except Exception as e: 
     print e 

#Output 
[u'EH1', u'S'] 
[u'S', u'IY1'] 
[u'S', u'IY1'] 
[u'K', u'AH0', u'M', u'P', u'Y', u'UW1', u'T'] 
'comput' 
'seesea'

是任何有不有限制，但能找到的任何真實或虛構的字/音素猜測的任何模塊？

如果沒有，有什麼辦法可以編程嗎？我正在考慮做循環來測試越來越多的單詞。例如在'seasee'中，第一個循環取「s」，下一個循環取「se」，第三個取「海」等等，並運行cmudict。雖然問題是我不知道如何表示它是正確的音素要考慮。例如，'seasee'中的's'和'sea'都會輸出一些有效的音素。

工作進展情況：

import nltk 

arpabet = nltk.corpus.cmudict.dict() 

for word in ('s', 'see', 'sea', 'compute', 'comput', 'seesea', 'darfasasawwa'): 
    try: 
     phone = arpabet[word][0] 
    except: 
     try: 
      counter = 0 
      for i in word: 
       substring = word[0:1+counter] 
       counter += 1 
       try: 
        print substring, arpabet[substring][0] 
       except Exception as e: 
        print e 
     except Exception as e: 
      print e 

#Output 
c [u'S', u'IY1'] 
co [u'K', u'OW1'] 
com [u'K', u'AA1', u'M'] 
comp [u'K', u'AA1', u'M', u'P'] 
compu [u'K', u'AA1', u'M', u'P', u'Y', u'UW0'] 
comput 'comput' 
s [u'EH1', u'S'] 
se [u'S', u'AW2', u'TH', u'IY1', u'S', u'T'] 
see [u'S', u'IY1'] 
sees [u'S', u'IY1', u'Z'] 
seese [u'S', u'IY1', u'Z'] 
seesea 'seesea' 
d [u'D', u'IY1'] 
da [u'D', u'AA1'] 
dar [u'D', u'AA1', u'R'] 
darf 'darf' 
darfa 'darfa' 
darfas 'darfas' 
darfasa 'darfasa' 
darfasas 'darfasas' 
darfasasa 'darfasasa' 
darfasasaw 'darfasasaw' 
darfasasaww 'darfasasaww' 
darfasasawwa 'darfasasawwa'

來源

2015-11-12 KubiK888

您可以使用LOGIOS Lexicon Tool。這是輸出的例子：

S EH S 
SEE S IY 
SEA S IY 
COMPUTE K AH M P Y UW T 
COMPUT K AH M P UH T 
SEESEA S IY S IY

我不知道任何Python實現的，你可以嘗試自己實現，或撥打使用perl codesubprocess.call

來源

2015-11-12 07:59:18 dimid

感謝但它可以得到它在Python中實現？我所擁有的一切都在Python中，並且我希望儘可能在腳本中自動化。 – KubiK888

我不知道任何python實現，你可以嘗試實現自己，或者調用[perl code]（https://github.com/skerit/cmusphinx/blob/master/logios/Tools/MakeDict/make_pronunciation。 pl）使用'subprocess.call'。編輯回答 – dimid

你也可以使用jython（還沒有測試過） http://source.cet.uct.ac.za/svn/people/smarquard/sphinx/scripts/word2phones.py – dimid

我遇到了同樣的問題，我通過遞歸分割來解決它（請參閱wordbreak）

import nltk 
from functools import lru_cache 
from itertools import product as iterprod 

try: 
    arpabet = nltk.corpus.cmudict.dict() 
except LookupError: 
    nltk.download('cmudict') 
    arpabet = nltk.corpus.cmudict.dict() 

@lru_cache() 
def wordbreak(s): 
    s = s.lower() 
    if s in arpabet: 
     return arpabet[s] 
    middle = len(s)/2 
    partition = sorted(list(range(len(s))), key=lambda x: (x-middle)**2-x) 
    for i in partition: 
     pre, suf = (s[:i], s[i:]) 
     if pre in arpabet and wordbreak(suf) is not None: 
      return [x+y for x,y in iterprod(arpabet[pre], wordbreak(suf))] 
    return None

來源

2017-10-29 11:37:18

從Python NLTK或其他模塊中獲取任何單詞的音素？

回答

相關問題