如何在spacy中添加新實體（ORG）實例nlp

我正在嘗試將庫存符號添加到識別爲ORG實體的字符串中。對於每一個符號，我做的：如何在spacy中添加新實體（ORG）實例nlp

nlp.matcher.add(symbol, u'ORG', {}, [[{u'orth': symbol}]])

我可以看到這個符號被添加到模式：

print "Patterns:", nlp.matcher._patterns

但不加確認之前的任何符號添加後不能識別。顯然，這些令牌已經存在於詞彙表中（這就是爲什麼詞彙長度不會改變的原因）。

我應該做什麼不同？我錯過了什麼？

感謝

這裏是我的示例代碼：

「簡短片段來練習添加股票代碼符號ORG實體」

from spacy.en import English 
import spacy.en 
from spacy.attrs import ORTH, TAG, LOWER, IS_ALPHA, FLAG63 
import os 
import csv 
import sys 

nlp = English() #Load everything for the English model 

print "Before nlp vocab length", len(nlp.matcher.vocab) 

symbol_list = [u"CHK", u"JONE", u"NE", u"DO", u"ESV"] 

txt = u"""drive double-digit rallies in Chesapeake Energy (NYSE: CHK), (NYSE: NE), (NYSE: DO), (NYSE: ESV), (NYSE: JONE)"""# u"""Drive double-digit rallies in Chesapeake Energy (NYSE: CHK), Noble Corporation (NYSE:NE), Diamond Offshore (NYSE:DO), Ensco (NYSE:ESV), and Jones Energy (NYSE: JONE)""" 
before = nlp(txt) 
for tok in before: #Before adding entities 
    print tok, tok.orth, tok.tag_, tok.ent_type_ 

for symbol in symbol_list: 
    print "adding symbol:", symbol 
    print "vocab length:", len(nlp.matcher.vocab) 
    print "pattern length:", nlp.matcher.n_patterns 
    nlp.matcher.add(symbol, u'ORG', {}, [[{u'orth': symbol}]]) 


print "Patterns:", nlp.matcher._patterns 
print "Entities:", nlp.matcher._entities 
for ent in nlp.matcher._entities: 
    print ent.label 

tokens = nlp(txt) 

print "\n\nAfter:" 
print "After nlp vocab length", len(nlp.matcher.vocab) 

for tok in tokens: 
    print tok, tok.orth, tok.tag_, tok.ent_type_

來源

2016-10-31 user1430965

如果你在> 1.0，你應該爲每個匹配器回調函數並手動合併令牌。 –

你能否提供更多的細節？ – user1430965

感謝您的建議，但是您能否提供更多的細節？我在哪裏添加回調？回調是什麼？如何手動合併令牌？對不起，我剛開始使用Spacy。謝謝，香草 – user1430965

這裏的基礎上，docs工作示例：

import spacy 

nlp = spacy.load('en') 

def merge_phrases(matcher, doc, i, matches): 
    ''' 
    Merge a phrase. We have to be careful here because we'll change the token indices. 
    To avoid problems, merge all the phrases once we're called on the last match. 
    ''' 
    if i != len(matches)-1: 
     return None 
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches] 
    for ent_id, label, span in spans: 
     span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label]) 

matcher = spacy.matcher.Matcher(nlp.vocab) 
matcher.add(entity_key='stock-nyse', label='STOCK', attrs={}, specs=[[{spacy.attrs.ORTH: 'NYSE'}]], on_match=merge_phrases) 
matcher.add(entity_key='stock-esv', label='STOCK', attrs={}, specs=[[{spacy.attrs.ORTH: 'ESV'}]], on_match=merge_phrases) 
doc = nlp(u"""drive double-digit rallies in Chesapeake Energy (NYSE: CHK), (NYSE: NE), (NYSE: DO), (NYSE: ESV), (NYSE: JONE)""") 
matcher(doc) 
print(['%s|%s' % (t.orth_, t.ent_type_) for t in doc])

- >

['drive|', 'double|', '-|', 'digit|', 'rallies|', 'in|', 'Chesapeake|ORG', 'Energy|ORG', '(|', 'NYSE|STOCK', ':|', 'CHK|', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'NE|GPE', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'DO|', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'ESV|STOCK', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'JONE|ORG', ')|']

NYSE和ESV現在標有STOCK實體類型。基本上，在每場比賽中，你應該手動合併令牌和/或分配你想要的實體類型。還有acceptor函數，它允許您在匹配時過濾/拒絕匹配。

來源

2016-11-11 10:02:44

謝謝。我正在旅行，但我一回去就會看這個。感謝你的幫助。 – user1430965

如何在spacy中添加新實體（ORG）實例nlp

回答

相關問題