2017-06-05 85 views
0

我試圖從古蘭經聖書中解讀一些詞,但有些詞不能詞法化。爲什麼NLTK Lemmatizer不能解釋一些複數單詞?

這裏是我的一句話:

sentence = "Then bring ten surahs like it that have been invented and call upon for assistance whomever you can besides Allah if you should be truthful" 

那句話是我的txt數據集的一部分。你可以看到有 ,有「surah」這是複數形式的「surah」。 我已經盡我的代碼:

def lemmatize(self, ayat): 
    wordnet_lemmatizer = WordNetLemmatizer() 
    result = [] 

    for i in xrange (len(ayat)): 
     result.append(wordnet_lemmatizer.lemmatize(sentence[i],'v')) 
    return result 

,當我運行和打印,結果是這樣的:

['bring', 'ten', 'surahs', 'like', u'invent', 'call', 'upon', 'assistance', 'whomever', 'besides', 'Allah', 'truthful'] 

的「surahs」不變成「古蘭經」。

有人可以說爲什麼?謝謝。

+0

沒有什麼不對的wordnetlemmatizer本身,而是它只是無法處理不規則的話不夠好。你可以試試這個'黑客' - https://stackoverflow.com/questions/22333392/stemming-some-plurals-with-wordnet-lemmatizer-doesnt-work –

+0

我試過那個黑客,但它沒有返回任何[] – sang

回答

1

對於大多數非標英語單詞,共發現Lemmatizer沒有什麼幫助中得到正確的引理,嘗試一個詞幹:

>>> from nltk.stem import PorterStemmer 
>>> porter = PorterStemmer() 
>>> porter.stem('surahs') 
u'surah' 

此外,嘗試lemmatize_sentearthy(一nltk包裝,「無恥插頭」):

>>> from earthy.nltk_wrappers import lemmatize_sent 
>>> sentence = "Then bring ten surahs like it that have been invented and call upon for assistance whomever you can besides Allah if you should be truthful" 
>>> lemmatize_sent(sentence) 
[('Then', 'Then', 'RB'), ('bring', 'bring', 'VBG'), ('ten', 'ten', 'RP'), ('surahs', 'surahs', 'NNS'), ('like', 'like', 'IN'), ('it', 'it', 'PRP'), ('that', 'that', 'WDT'), ('have', 'have', 'VBP'), ('been', u'be', 'VBN'), ('invented', u'invent', 'VBN'), ('and', 'and', 'CC'), ('call', 'call', 'VB'), ('upon', 'upon', 'NN'), ('for', 'for', 'IN'), ('assistance', 'assistance', 'NN'), ('whomever', 'whomever', 'NN'), ('you', 'you', 'PRP'), ('can', 'can', 'MD'), ('besides', 'besides', 'VB'), ('Allah', 'Allah', 'NNP'), ('if', 'if', 'IN'), ('you', 'you', 'PRP'), ('should', 'should', 'MD'), ('be', 'be', 'VB'), ('truthful', 'truthful', 'JJ')] 

>>> words, lemmas, tags = zip(*lemmatize_sent(sentence)) 
>>> lemmas 
('Then', 'bring', 'ten', 'surahs', 'like', 'it', 'that', 'have', u'be', u'invent', 'and', 'call', 'upon', 'for', 'assistance', 'whomever', 'you', 'can', 'besides', 'Allah', 'if', 'you', 'should', 'be', 'truthful') 

>>> from earthy.nltk_wrappers import pywsd_lemmatize 
>>> pywsd_lemmatize('surahs') 
'surahs' 

>>> from earthy.nltk_wrappers import porter_stem 
>>> porter_stem('surahs') 
u'surah' 
+0

哇,謝謝。這很酷。但什麼是「泥土」模塊,我在哪裏可以得到?我不能稱之爲「泥土」,模塊的名字是未定義的。 – sang

+0

'pip install -U earthy' – alvas

+0

很酷謝謝,我已經安裝了。有沒有任何書籍或土library圖書館的教程? – sang

相關問題