2015-07-10 59 views
1

讓我們試用nltk軟件包中Python的標準詞類標註器。挑戰NLTK詞性標註器報告複數專有名詞

import nltk 
# You might also need to run nltk.download('maxent_treebank_pos_tagger') 
# even after installing nltk 

string = 'Buddy Billy went to the moon and came Back with several Vikings.' 
nltk.pos_tag(nltk.word_tokenize(string)) 

這給了我

[( '好友', 'NNP'),( '比利', 'NNP'),( '去', 'VBD'),('到'','TO'), ('the','DT'),('moon','NN'),('和','CC'),('come','VBD'), '''','''','''','''','''','''','''''''','''''''')

您可以解釋代碼here。我對「返回」被歸類爲專有名詞(NNP)感到有些失望,儘管混淆是可以理解的。我更爲難過的是,'維京人'被稱爲簡單複數名詞(NNS),而不是複數專有名詞(NNPS)。任何人都可以想出一個導致至少有一個NNPS標籤的簡單輸入的例子嗎?

回答

0

似乎有一些問題與NLTK棕色語料庫中標記NNPS作爲NPS(可能是NLTK標記集是一個更新的/過時的標籤是從https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html不同)

下面是plural proper nouns一個例子標籤:

>>> from nltk.corpus import brown 
>>> for sent in brown.tagged_sents(): 
...  if any(pos for word, pos in sent if pos == 'NPS'): 
...    print sent 
...    break 
... 
[(u'Georgia', u'NP'), (u'Republicans', u'NPS'), (u'are', u'BER'), (u'getting', u'VBG'), (u'strong', u'JJ'), (u'encouragement', u'NN'), (u'to', u'TO'), (u'enter', u'VB'), (u'a', u'AT'), (u'candidate', u'NN'), (u'in', u'IN'), (u'the', u'AT'), (u'1962', u'CD'), (u"governor's", u'NN$'), (u'race', u'NN'), (u',', u','), (u'a', u'AT'), (u'top', u'JJS'), (u'official', u'NN'), (u'said', u'VBD'), (u'Wednesday', u'NR'), (u'.', u'.')] 

但是如果你nltk.pos_tag標籤,你會得到NNPS

>>> for sent in brown.tagged_sents(): 
...  if any(pos for word, pos in sent if pos == 'NPS'): 
...    print " ".join([word for word, pos in sent]) 
...    break 
... 
Georgia Republicans are getting strong encouragement to enter a candidate in the 1962 governor's race , a top official said Wednesday . 
>>> from nltk import pos_tag 
>>> pos_tag("Georgia Republicans are getting strong encouragement to enter a candidate in the 1962 governor's race , a top official said Wednesday .".split()) 
[('Georgia', 'NNP'), ('Republicans', 'NNPS'), ('are', 'VBP'), ('getting', 'VBG'), ('strong', 'JJ'), ('encouragement', 'NN'), ('to', 'TO'), ('enter', 'VB'), ('a', 'DT'), ('candidate', 'NN'), ('in', 'IN'), ('the', 'DT'), ('1962', 'CD'), ("governor's", 'NNS'), ('race', 'NN'), (',', ','), ('a', 'DT'), ('top', 'JJ'), ('official', 'NN'), ('said', 'VBD'), ('Wednesday', 'NNP'), ('.', '.')] 
+0

我證實'佐治亞共和黨人...... ...輸入成功地引發了'NNPS'標籤。 – zkurtz

+0

就你的觀點而言,「標籤似乎存在一些問題......」,我並不覺得「褐色」標籤返回的標籤(http://www.comp.leeds.ac.uk/ccalas/ tagsets/brown.html)與'nltk.pos_tag'生成的標籤不同,因爲後者基於Penn Treebank語料庫,該語料庫使用完全不同的標記集(http://cs.nyu.edu/grishman/jet/導向/ PennPOS.html) – zkurtz