默認NLTK pos_tag
已不知何故瞭解到please
是名詞。幾乎在任何情況下,用適當的英文都是不正確的,例如
>>> from nltk import pos_tag
>>> pos_tag('Please go away !'.split())
[('Please', 'NNP'), ('go', 'VB'), ('away', 'RB'), ('!', '.')]
>>> pos_tag('Please'.split())
[('Please', 'VB')]
>>> pos_tag('please'.split())
[('please', 'NN')]
>>> pos_tag('please !'.split())
[('please', 'NN'), ('!', '.')]
>>> pos_tag('Please !'.split())
[('Please', 'NN'), ('!', '.')]
>>> pos_tag('Would you please go away ?'.split())
[('Would', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('go', 'VB'), ('away', 'RB'), ('?', '.')]
>>> pos_tag('Would you please go away !'.split())
[('Would', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('go', 'VB'), ('away', 'RB'), ('!', '.')]
>>> pos_tag('Please go away ?'.split())
[('Please', 'NNP'), ('go', 'VB'), ('away', 'RB'), ('?', '.')]
以WordNet作爲基準,不應該存在please
是名詞的情況。
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('please')
[Synset('please.v.01'), Synset('please.v.02'), Synset('please.v.03'), Synset('please.r.01')]
但我認爲這主要是因爲這是用來訓練PerceptronTagger
而不是惡搞本身的實現文本。
現在,我們來看看裏面預先訓練PerceptronTragger
什麼,我們看到,只知道1500+的話:
你可以做
一個竅門是破解惡搞:
>>> tagger.tagdict['start'] = 'VB'
>>> tagger.tagdict['please'] = 'VB'
>>> tagger.tag('please start with me'.split())
[('please', 'VB'), ('start', 'VB'), ('with', 'IN'), ('me', 'PRP')]
但要做到最順理成章的事情是簡單地重新訓練惡搞,看到http://www.nltk.org/_modules/nltk/tag/perceptron.html#PerceptronTagger.train
如果你不想再培訓捉,後來看到Python NLTK pos_tag not returning the correct part-of-speech tag
最可能的是,使用StanfordPOSTagger
得到你所需要的:
>>> from nltk import StanfordPOSTagger
>>> sjar = '/home/alvas/stanford-postagger/stanford-postagger.jar'
>>> m = '/home/alvas/stanford-postagger/models/english-left3words-distsim.tagger'
>>> spos_tag = StanfordPOSTagger(m, sjar)
>>> spos_tag.tag('Please go away !'.split())
[(u'Please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'!', u'.')]
>>> spos_tag.tag('Please'.split())
[(u'Please', u'VB')]
>>> spos_tag.tag('Please !'.split())
[(u'Please', u'VB'), (u'!', u'.')]
>>> spos_tag.tag('please !'.split())
[(u'please', u'VB'), (u'!', u'.')]
>>> spos_tag.tag('please'.split())
[(u'please', u'VB')]
>>> spos_tag.tag('Would you please go away !'.split())
[(u'Would', u'MD'), (u'you', u'PRP'), (u'please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'!', u'.')]
>>> spos_tag.tag('Would you please go away ?'.split())
[(u'Would', u'MD'), (u'you', u'PRP'), (u'please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'?', u'.')]
對於Linux:參見https://gist.github.com/alvations/e1df0ba227e542955a8a
對於Windows:請參閱https://gist.github.com/alvations/0ed8641d7d2e1941b9f9
此問題可能是由於句子中使用了不正確的英語。 NLTK對英語適用很有效,但語法錯誤的句子會導致問題。 '請和我一起開始'是一個句子片段。另外我想象你的代碼有更多的錯誤,因爲我在NLTK POS tagger這裏試過這句話:http://textanalysisonline.com/nltk-pos-tagging它工作得很好:'start | NN please | NN with | IN me | PRP' – Rob
使用不同的POS標記器?也許默認的英語不太好或不健全。 –
如何使用其他? –