nltk StanfordNERTagger：如何獲得沒有大寫的專有名詞

我正在嘗試使用StanfordNERTagger和nltk從一段文本中提取關鍵字。nltk StanfordNERTagger：如何獲得沒有大寫的專有名詞

docText="John Donk works for POI. Brian Jones wants to meet with Xyz Corp. for measuring POI's Short Term performance Metrics." 

words = re.split("\W+",docText) 

stops = set(stopwords.words("english")) 

    #remove stop words from the list 
words = [w for w in words if w not in stops and len(w) > 2] 

str = " ".join(words) 
print str 
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') 
stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP'] 

print "Stanford POS Tagged" 
print stanfordPosTagList 
tagged = stn.tag(stanfordPosTagList) 
print tagged

這給了我

John Donk works POI Brian Jones wants meet Xyz Corp measuring POI Short Term performance Metrics 
Stanford POS Tagged 
[u'John', u'Donk', u'POI', u'Brian', u'Jones', u'Xyz', u'Corp', u'POI', u'Short', u'Term'] 
[(u'John', u'PERSON'), (u'Donk', u'PERSON'), (u'POI', u'ORGANIZATION'), (u'Brian', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION'), (u'Xyz', u'ORGANIZATION'), (u'Corp', u'ORGANIZATION'), (u'POI', u'O'), (u'Short', u'O'), (u'Term', u'O')]

這麼清楚，像Short和Term被標記爲NNP。我所擁有的數據包含很多這樣的情況，其中非NNP單詞大寫。這可能是由於錯別字或者他們是頭文件。我沒有太多的控制權。

如何解析或清理數據，以便我可以檢測到非NNP條款，即使它可能是大寫字母？ 我不希望像Short和Term條款被歸類爲NNP

而且，不知道爲什麼John Donk被抓獲的人，但Brian Jones沒有。是否可以歸因於我的數據中的其他大寫非NNP？這可能會對StanfordNERTagger如何處理所有事情產生影響嗎？

更新，一個可能的解決方案

這是我打算做

每次取字，並轉換爲小寫
標籤小寫字
如果標籤NNP那麼我們知道原始單詞也必須是NNP
如果不是，那麼或者iginal詞被誤資本

這是我試圖做

str = " ".join(words) 
print str 
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') 
for word in str.split(): 
    wl = word.lower() 
    print wl 
    w,pos = stp.tag(wl) 
    print pos 
    if pos=="NNP": 
     print "Got NNP" 
     print w

，但是這給了我錯誤

John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics 
john 
Traceback (most recent call last): 
    File "X:\crp.py", line 37, in <module> 
    w,pos = stp.tag(wl) 
ValueError: too many values to unpack

我曾嘗試多種方法，但有些錯誤總是顯示出來。 我如何標記一個單詞？

我不想將整個字符串轉換爲小寫，然後標記。如果我這樣做，StanfordPOSTagger返回一個空字符串

來源

2015-12-23 AbtPst

首先，看看你的其他問題來設置斯坦福CoreNLP從命令行或python調用：nltk : How to prevent stemming of proper nouns。

對於正確的套管句話我們可以看到，NER正常工作：

>>> for token in annotated_sent1['tokens']: 
...  print token['word'], token['lemma'], token['pos'], token['ner'] 
... 
john john NN O 
donk donk JJ O 
works work NNS O 
poi poi VBP O 
jones jone NNS O 
wants want VBZ O 
meet meet VB O 
xyz xyz NN O 
corp corp NN O 
measuring measure VBG O 
poi poi NN O 
short short JJ O 
term term NN O 
performance performance NN O 
metrics metric NNS O

：

>>> from corenlp import StanfordCoreNLP 
>>> nlp = StanfordCoreNLP('http://localhost:9000') 
>>> text = ('John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics. ' 
... 'john donk works poi jones wants meet xyz corp measuring poi short term performance metrics') 
>>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner', 'outputFormat': 'json'}) 
>>> annotated_sent0 = output['sentences'][0] 
>>> annotated_sent1 = output['sentences'][1] 
>>> for token in annotated_sent0['tokens']: 
...  print token['word'], token['lemma'], token['pos'], token['ner'] 
... 
John John NNP PERSON 
Donk Donk NNP PERSON 
works work VBZ O 
POI POI NNP ORGANIZATION 
Jones Jones NNP ORGANIZATION 
wants want VBZ O 
meet meet VB O 
Xyz Xyz NNP ORGANIZATION 
Corp Corp NNP ORGANIZATION 
measuring measure VBG O 
POI poi NN O 
short short JJ O 
term term NN O 
performance performance NN O 
metrics metric NNS O 
. . . O

而且對於降低套管句話，你不會爲POS標籤也沒有任何NER標籤得到NNP

所以你的問題應該是：

什麼是您的NLP應用程序的最終目標是什麼？
爲什麼您的輸入較低？是你在做什麼或如何提供數據？

和回答這些問題後，您可以如果輸入小寫轉移到決定什麼你真的想與NER標籤做的，即

，它是由於你如何組織你的NLP工具鏈，然後
- 不這樣做！在普通文本上執行NER，不會產生失真。這是因爲NER接受了正常文本的培訓，所以它不會在正常文本的背景下運行。
- 也儘量不要從不同的套件混用NLP工具，他們通常不會發揮不錯，尤其是在你的NLP工具鏈
結束時，如果輸入的是小寫的，因爲這是原來的如何數據是，則：
- 註釋的數據的一小部分，或者發現在小寫和重新訓練然後一個模型註釋的數據。
- 解決此問題並使用普通文本訓練truecaser，然後將truecasing模型應用於底層文本。見https://www.cs.cmu.edu/~llita/papers/lita.truecasing-acl2003.pdf
如果輸入錯誤的有套管，例如`一些大一些小但不是全部都是專有名詞，然後
- 也嘗試使用TrueCasing解決方案。

來源

2015-12-24 21:44:27 alvas

非常感謝您的幫助:)作爲跟進，什麼POS是英語中常用的專有名詞？ – AbtPst

從Penntree Bank標記集：'NNP'和'NNPS'（請參閱https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html） – alvas

正確，但是在給定的文本中POS標籤可能在專有名詞周圍？有這種可能嗎？ – AbtPst

首先，您不應該在程序中使用預定義的關鍵字作爲變量名稱。避免使用str作爲變量名稱。而是使用newstring或其他任何東西。

在更新中，您將每個小寫字母傳遞給POS標記器。 tag()方法分割傳遞給它的每個字符串，併爲每個字符提供POS標記。

所以我建議你通過list而不是一個字tag()方法。該列表一次只能包含一個單詞。

，您可以嘗試這樣的：w = stp.tag([wl]) w將兩個項目[w1,POS]

這樣你可以標記一個單詞列表

但在這種情況下，它給出了john POS標籤作爲NN

來源

2015-12-24 07:32:49

謝謝你！但我如何提取NN？對於每個單詞我想看到POS並做一些處理。當我嘗試打印stp.tag（[wl.lower（）]）[1]它說索引超出範圍。索引[0]打印兩個元素爲（u'john'，u'NN'） – AbtPst

忘掉它。我得到了這個x = stp.tag（[w.lower（）]） y = x [0] print y [1] :) – AbtPst

只要做'w [1]'你會得到這個詞的POS。不要試圖在一行中做所有事情。 –

nltk StanfordNERTagger：如何獲得沒有大寫的專有名詞

回答

相關問題