即使verb.exc添加了正確的值，爲什麼NLTK詞形化輸出錯誤？

當我打開verb.exc，我可以看到即使verb.exc添加了正確的值，爲什麼NLTK詞形化輸出錯誤？

saw see

雖然我用的詞形還原代碼

>>>print lmtzr.lemmatize('saw', 'v') 
saw

這怎麼可能發生？我在修改wordNet時有誤解嗎？

來源

2015-11-08 Leo Hsieh

總之：

這有點異常的奇怪情況。

還有一種情況，I saw the log the into half.其中「saw」是現在時動詞。

見@nschneid解決方案中提出的問題，使用起來更加細粒度標籤：https://github.com/nltk/nltk/issues/1196

在長：

如果我們看看我們如何調用共發現lemmatizer在NLTK：

>>> from nltk.stem import WordNetLemmatizer 
>>> wnl = WordNetLemmatizer() 
>>> wnl.lemmatize('saw', pos='v') 
'saw' 
>>> wnl.lemmatize('saw') 
'saw'

指定POS標籤似乎是多餘的。讓我們來看看lemmatizer代碼本身：

class WordNetLemmatizer(object): 
    def __init__(self): 
     pass 

    def lemmatize(self, word, pos=NOUN): 
     lemmas = wordnet._morphy(word, pos) 
     return min(lemmas, key=len) if lemmas else word

它所做的是它依賴於WordNet的語料庫的_moprhy屬性返回可能引理。

如果我們通過nltk.corpus.wordnet代碼線程，我們可以看到在https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1679

功能的前幾行讀取共發現的verb.exc異常文件_morphy()代碼，即https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1687

因此，如果我們做一個廣告-hoc搜索lemmatizer功能之外的例外，我們也看到，'saw' -> 'see'：

>>> from nltk.corpus import wordnet as wn 
>>> exceptions = wn._exception_map['v'] 
>>> exceptions['saw'] 
[u'see']

因此，如果我們CAL L時lemmatizer之外_morphy()功能：

>>> from nltk.corpus import wordnet as wn 
>>> exceptions = wn._exception_map['v'] 
>>> wn._morphy('saw', 'v') 
['saw', u'see']

讓我們回到WordNetLemmatizer.lemmatize()碼的回線，我們看到return min(lemmas, key=len) if lemmas else word：

def lemmatize(self, word, pos=NOUN): 
    lemmas = wordnet._morphy(word, pos) 
    return min(lemmas, key=len) if lemmas else word

因此，這意味着該函數將返回從wn._morphy()輸出最小長度。但在這種情況下，看到和看到的長度都是相同的，因此wn._morphy()返回的列表中的第一個將是返回的，即saw。

有效地，WordNetLemmatizer.lemmatize()是這樣做的：

>>> from nltk.corpus import wordnet as wn 
>>> wn._morphy('saw', 'v') 
['saw', u'see'] 
>>> min(wn._morphy('saw', 'v'), key=len) 
'saw'

所以，問題是：

我怎樣才能避免這種「錯誤」在NLTK？
如何解決NLTK中的這個「bug」？

但是請注意，這並不完全是一個「錯誤」，而是一個「功能」來表示一個表面單詞的其他可能的引理（雖然在這種特定情況下這個詞是罕見的，如I saw the log into half。

我怎樣才能避免這種NLTK「錯誤」？

爲了避免NLTK這個「錯誤」，使用nltk.wordnet._morphy()代替nltk.stem.WordNetLemmatizer.lemmatize()這樣你總是會得到可能引理的列表，而不是由長度過濾的引理。導入：

>>> from nltk.corpus import wordnet as wn 
>>> exceptions = wn._exception_map['v'] 
>>> wn._morphy('saw', pos='v') 
['saw', 'see']

更多的選擇比錯誤的選擇更好。

如何解決這個「bug」在NLTK？

除了min(lemmas, key=len)是次優的，因爲用的是複數的話罕見的含義例外可能是一個引理本身，例如打交道時的_morphy()功能有點不一致使用teeth指假牙，看http://wordnetweb.princeton.edu/perl/webwn?s=teeth

>>> wn._morphy('teeth', 'n') 
['teeth', u'tooth'] 
>>> wn._morphy('goose', 'n') 
['goose'] 
>>> wn._morphy('geese', 'n') 
[u'goose']

所以引理選擇的錯誤必須已在例外列表中後nltk.wordnet._morphy()功能介紹。如果輸入表面字出現在例外列表中，則立即返回例外列表的第一個例子，例如：

from nltk.corpus import wordnet as wn 
def _morphy(word, pos): 
    exceptions = wn._exception_map[pos] 
    if word in exceptions: 
     return exceptions[word] 

    # Else, continue the rest of the _morphy code.

來源

2015-11-08 23:24:55 alvas

Thx @alvas，您的答案是驚人的。我會檢查是否有任何錯誤的異常詞並回復你。 –

我很高興答案=） – alvas

即使verb.exc添加了正確的值，爲什麼NLTK詞形化輸出錯誤？

回答

相關問題