wordnet lemmatization和pos標籤在python

我想在python中使用wordnet lemmatizer，我已經瞭解到默認的pos標籤是NOUN，並且它不會輸出動詞的正確引理，除非pos標籤被顯式指定爲動詞。wordnet lemmatization和pos標籤在python

我的問題是什麼是最好的鏡頭，以準確地執行上述的lemmaization？

我做了pos標記使用nltk.pos_tag，我迷失在將樹庫pos標籤集成到wordnet兼容pos標籤。請幫忙

from nltk.stem.wordnet import WordNetLemmatizer 
lmtzr = WordNetLemmatizer() 
tagged = nltk.pos_tag(tokens)

我得到NN，JJ，VB，RB中的輸出標籤。如何將這些更改爲與wordnet兼容的標籤？

我也有訓練nltk.pos_tag()帶標籤的語料庫或我可以直接在我的數據上使用它來評估？

來源

2013-03-23 user1946217

首先，您可以直接使用nltk.pos_tag()而無需對其進行培訓。該功能將從文件加載預訓練標記器。你可以看到文件名與nltk.tag._POS_TAGGER：

nltk.tag._POS_TAGGER 
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle'

由於它是用樹庫語料訓練的，該機還採用了Treebank tag set。

下面的函數將映射的樹庫標記語音名共發現部分：

from nltk.corpus import wordnet 

def get_wordnet_pos(treebank_tag): 

    if treebank_tag.startswith('J'): 
     return wordnet.ADJ 
    elif treebank_tag.startswith('V'): 
     return wordnet.VERB 
    elif treebank_tag.startswith('N'): 
     return wordnet.NOUN 
    elif treebank_tag.startswith('R'): 
     return wordnet.ADV 
    else: 
     return ''

然後，您可以使用與lemmatizer返回值：

from nltk.stem.wordnet import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
lemmatizer.lemmatize('going', wordnet.VERB) 
>>> 'go'

來源

2013-03-23 18:15:42

還記得衛星形容詞=）'ADJ_SAT ='s'' http://wordnet.princeton.edu/wordnet/man/wngloss.7WN.html – alvas 2013-04-05 05:52:32

「'I」的pos標籤我喜歡它。「字符串是」PRP「。該函數返回lemmatizer不接受的空字符串並拋出一個'KeyError'。在這種情況下可以做些什麼？ – 2017-03-08 06:49:47

有沒有人知道處理整個文檔時效率如何？ – Ksofiac 2017-07-26 17:31:25

如源代碼nltk.corpus.reader.wordnet（http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html）

#{ Part-of-speech constants 
ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v' 
#} 
POS_LIST = [NOUN, VERB, ADJ, ADV]

來源

2014-07-25 05:44:07 mousecoder

或更一般地說： from nltk.corpus import wordnet; print wordnet._FILEMAP; – mPrinC 2016-10-26 21:26:49

@Suzana_K式爲W工作會有。但是我有一些情況導致KeyError作爲@ Clock Slave提及。

轉換樹庫標籤WORDNET標籤

from nltk.corpus import wordnet 

def get_wordnet_pos(treebank_tag): 

    if treebank_tag.startswith('J'): 
     return wordnet.ADJ 
    elif treebank_tag.startswith('V'): 
     return wordnet.VERB 
    elif treebank_tag.startswith('N'): 
     return wordnet.NOUN 
    elif treebank_tag.startswith('R'): 
     return wordnet.ADV 
    else: 
     return None # for easy if-statement

現在，我們只輸入POS到lemmatize功能，只有當我們已經WORDNET標籤

from nltk.stem.wordnet import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
tagged = nltk.pos_tag(tokens) 
for word, tag in tagged: 
    wntag = get_wordnet_pos(tag) 
    if wntag is None:# not supply tag in case of None 
     lemma = lemmatizer.lemmatize(word) 
    else: 
     lemma = lemmatizer.lemmatize(word, pos=wntag)

來源

2017-09-15 04:08:08

步驟轉換：文檔 - >句 - > Tokens-> POS-> Lemmas

import nltk 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet 

#example text text = 'What can I say about this place. The staff of these restaurants is nice and the eggplant is not bad' 

class Splitter(object): 
    """ 
    split the document into sentences and tokenize each sentence 
    """ 
    def __init__(self): 
     self.splitter = nltk.data.load('tokenizers/punkt/english.pickle') 
     self.tokenizer = nltk.tokenize.TreebankWordTokenizer() 

    def split(self,text): 
     """ 
     out : ['What', 'can', 'I', 'say', 'about', 'this', 'place', '.'] 
     """ 
     # split into single sentence 
     sentences = self.splitter.tokenize(text) 
     # tokenization in each sentences 
     tokens = [self.tokenizer.tokenize(sent) for sent in sentences] 
     return tokens 


class LemmatizationWithPOSTagger(object): 
    def __init__(self): 
     pass 
    def get_wordnet_pos(self,treebank_tag): 
     """ 
     return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
     """ 
     if treebank_tag.startswith('J'): 
      return wordnet.ADJ 
     elif treebank_tag.startswith('V'): 
      return wordnet.VERB 
     elif treebank_tag.startswith('N'): 
      return wordnet.NOUN 
     elif treebank_tag.startswith('R'): 
      return wordnet.ADV 
     else: 
      # As default pos in lemmatization is Noun 
      return wordnet.NOUN 

    def pos_tag(self,tokens): 
     # find the pos tagginf for each tokens [('What', 'WP'), ('can', 'MD'), ('I', 'PRP') .... 
     pos_tokens = [nltk.pos_tag(token) for token in tokens] 

     # lemmatization using pos tagg 
     # convert into feature set of [('What', 'What', ['WP']), ('can', 'can', ['MD']), ... ie [original WORD, Lemmatized word, POS tag] 
     pos_tokens = [ [(word, lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag)), [pos_tag]) for (word,pos_tag) in pos] for pos in pos_tokens] 
     return pos_tokens 

lemmatizer = WordNetLemmatizer() 
splitter = Splitter() 
lemmatization_using_pos_tagger = LemmatizationWithPOSTagger() 

#step 1 split document into sentence followed by tokenization 
tokens = splitter.split(text) 

#step 2 lemmatization using pos tagger 
lemma_pos_token = lemmatization_using_pos_tagger.pos_tag(tokens) 
print(lemma_pos_token)

來源

2017-10-04 11:55:41 Deepak

您可以在一行做到這一點：

wnpos = lambda e: ('a' if e[0].lower() == 'j' else e[0].lower()) if e[0].lower() in ['n', 'r', 'v'] else 'n'

然後使用wnpos(nltk_pos)拿到POS給予.lemmatize（）。在你的情況下，lmtzr.lemmatize(word=tagged[0][0], pos=wnpos(tagged[0][1]))。

來源

2017-12-10 00:15:09 wordsforthewise

wordnet lemmatization和pos標籤在python

回答

相關問題