我已經安裝了Spacy和en_core_web_sm數據。 如果我嘗試我的代碼,應該提取隨機新聞文章中的人員信息,我可以獲得大約50%的正確數據。其餘包含問題和錯誤。如何提高Spacy結果的質量?
import spacy
import io
from spacy.en import English
from spacy.parts_of_speech import NOUN
from spacy.parts_of_speech import ADP as PREP
nlp = English()
ents = list(doc.ents)
for entity in ents:
if entity.label_ == 'PERSON':
print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))
在此文件,例如: http://www.abc.net.au/news/2015-10-30/is-nauru-virtually-a-failed-state/6869648 我得到這些結果:
(377, u'PERSON', u'Lukas Coch)\\nMap')
(377, u'PERSON', u'\\"never')
(377, u'PERSON', u'Julie Bishop')
(377, u'PERSON', u'Tanya Plibersek')
(377, u'PERSON', u'Mr Eames')
(377, u'PERSON', u'DFAT')
(377, u'PERSON', u'2015Andrew Wilkie')
(377, u'PERSON', u'Daniel Th\xfcrer')
(377, u'PERSON', u'Australian Aid')
(377, u'PERSON', u'Nauru')
(377, u'PERSON', u'Rule')
這怎麼可能增加結果的質量?
整個en_core_web_md有幫助嗎?
或者那些NLP庫方法總是比像TensorFlow這樣的深度學習包更糟?