2014-04-10 35 views
2

我正在使用python,NLTK和WordNetLemmatizer進行lemmatizer。 這裏是一個隨機的文本輸出什麼,我期待Python NLTK使用wordnet「進一步」的詞形化

from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet 
lem = WordNetLemmatizer() 
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective 

輸出:'bad'

​​

輸出:'worse'

嗯,這裏一切都很好。該行爲與其他的形容詞像'better'(於非常規的形式)或'older'相同(請注意,相同的測試與'elder'永遠不會輸出'old',但我想這WordNet是不是所有的現有的英文單詞的詳盡列表)

我的問題帶有字'furter'嘗試時:

lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective 

輸出:'further'

lem.lemmatize('further', pos=wordnet.ADV) // as an adverb 

輸出: 'far'

這與'worse'這個詞完全相反!

有人可以解釋我爲什麼嗎?它是來自wordnet synsets數據的錯誤還是來自我對英語語法的誤解?

請原諒,如果問題已經回答,我搜索谷歌和SO,但是當指定關鍵詞「更遠」時,我可以找到任何相關的東西,但由於這個詞的普及而混亂......

謝謝你在前進, 羅曼G.

+0

不錯風風火火共發現=的特殊性) – alvas

回答

5

WordNetLemmatizer使用._morphy功能來訪問它的一個字引理;從http://www.nltk.org/_modules/nltk/stem/wordnet.html並返回最小長度的可能引物。

def lemmatize(self, word, pos=NOUN): 
    lemmas = wordnet._morphy(word, pos) 
    return min(lemmas, key=len) if lemmas else word 

._morphy函數迭代應用規則得到一個引理;規則不斷縮短詞的長度並用MORPHOLOGICAL_SUBSTITUTIONS代替詞綴。然後它看到是否有其他的話,是短,但同爲降低字:

def _morphy(self, form, pos): 
    # from jordanbg: 
    # Given an original string x 
    # 1. Apply rules once to the input to get y1, y2, y3, etc. 
    # 2. Return all that are in the database 
    # 3. If there are no matches, keep applying rules until you either 
    # find a match or you can't go any further 

    exceptions = self._exception_map[pos] 
    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos] 

    def apply_rules(forms): 
     return [form[:-len(old)] + new 
       for form in forms 
       for old, new in substitutions 
       if form.endswith(old)] 

    def filter_forms(forms): 
     result = [] 
     seen = set() 
     for form in forms: 
      if form in self._lemma_pos_offset_map: 
       if pos in self._lemma_pos_offset_map[form]: 
        if form not in seen: 
         result.append(form) 
         seen.add(form) 
     return result 

    # 0. Check the exception lists 
    if form in exceptions: 
     return filter_forms([form] + exceptions[form]) 

    # 1. Apply rules once to the input to get y1, y2, y3, etc. 
    forms = apply_rules([form]) 

    # 2. Return all that are in the database (and check the original too) 
    results = filter_forms([form] + forms) 
    if results: 
     return results 

    # 3. If there are no matches, keep applying rules until we find a match 
    while forms: 
     forms = apply_rules(forms) 
     results = filter_forms(forms) 
     if results: 
      return results 

    # Return an empty list if we can't find anything 
    return [] 

但是,如果這個詞是例外列表,它會返回保存在exceptions一個固定值,見_load_exception_maphttp://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html

def _load_exception_map(self): 
    # load the exception file data into memory 
    for pos, suffix in self._FILEMAP.items(): 
     self._exception_map[pos] = {} 
     for line in self.open('%s.exc' % suffix): 
      terms = line.split() 
      self._exception_map[pos][terms[0]] = terms[1:] 
    self._exception_map[ADJ_SAT] = self._exception_map[ADJ] 

讓我們回到你的榜樣,worse - >badfurther - >far無法從規則實現的,因此它必須從例外列表中。由於這是一個例外清單,所以一定會有不一致之處。

例外列表保存在~/nltk_data/corpora/wordnet/adv.exc~/nltk_data/corpora/wordnet/adv.exc

adv.exc

best well 
better well 
deeper deeply 
farther far 
further far 
harder hard 
hardest hard 

adj.exc

... 
worldliest worldly 
wormier wormy 
wormiest wormy 
worse bad 
worst bad 
worthier worthy 
worthiest worthy 
wrier wry 
... 
+0

那麼下面你說什麼,我編輯的文件管理器'adj.exc' ,並增加了一行:「更遠」。 結果是:'lem.lemmatize('further',pos = wordnet.ADJ)''''far'。 非常好,非常感謝,這是一個很好的答案! –

+1

呵呵,簡單的修正,在我的情況下(Mac OS X),異常列表保存在'〜/ nltk_data/corpora/wordnet/*。exc'中(而不是'〜/ nltk_data/wordnet/*。exc')。 –

相關問題