2016-07-10 29 views
2

我想在希伯來語中使用polyglot包進行命名實體識別。
這是我的代碼:使用polyglot包進行命名實體識別在希伯來語

# -*- coding: utf8 -*- 
import polyglot 
from polyglot.text import Text, Word 
from polyglot.downloader import downloader 
downloader.download("embeddings2.iw") 
text = Text(u"in france and in germany") 
print(type(text)) 
text2 = Text(u"נסעתי מירושלים לתל אביב") 
print(type(text2)) 
print(text.entities) 
print(text2.entities) 

這是輸出:

<class 'polyglot.text.Text'> 
<class 'polyglot.text.Text'> 
[I-LOC([u'france']), I-LOC([u'germany'])] 
Traceback (most recent call last): 
    File "C:/Python27/Lib/site-packages/IPython/core/pyglot.py", line 15, in <module> 
    print(text2.entities) 
    File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 20, in __get__ 
    value = obj.__dict__[self.func.__name__] = self.func(obj) 
    File "C:\Python27\lib\site-packages\polyglot\text.py", line 132, in entities 
    for i, (w, tag) in enumerate(self.ne_chunker.annotate(self.words)): 
    File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 20, in __get__ 
    value = obj.__dict__[self.func.__name__] = self.func(obj) 
    File "C:\Python27\lib\site-packages\polyglot\text.py", line 100, in ne_chunker 
    return get_ner_tagger(lang=self.language.code) 
    File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 30, in memoizer 
    cache[key] = obj(*args, **kwargs) 
    File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 191, in get_ner_tagger 
    return NEChunker(lang=lang) 
    File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 104, in __init__ 
    super(NEChunker, self).__init__(lang=lang) 
    File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 40, in __init__ 
    self.predictor = self._load_network() 
    File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 109, in _load_network 
    self.embeddings = load_embeddings(self.lang, type='cw', normalize=True) 
    File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 30, in memoizer 
    cache[key] = obj(*args, **kwargs) 
    File "C:\Python27\lib\site-packages\polyglot\load.py", line 61, in load_embeddings 
    p = locate_resource(src_dir, lang) 
    File "C:\Python27\lib\site-packages\polyglot\load.py", line 43, in locate_resource 
    if downloader.status(package_id) != downloader.INSTALLED: 
    File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 738, in status 
    info = self._info_or_id(info_or_id) 
    File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 508, in _info_or_id 
    return self.info(info_or_id) 
    File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 934, in info 
    raise ValueError('Package %r not found in index' % id) 
ValueError: Package u'embeddings2.iw' not found in index 

的英語工作,但不是希伯來語。
無論我嘗試下載包u'embeddings2.iw'或不獲取:

ValueError: Package u'embeddings2.iw' not found in index 

回答

3

我知道了!
這對我來說似乎是一個錯誤。
語言檢測將語言定義爲'iw'這是以前的ISO 639語言代碼希伯來語,並更改爲'he'。 的text.entities不承認iw代碼,所以我改變它像這樣:

text2.hint_language_code = 'he'