在Anconda/NLTK中找不到Genia標記文件錯誤

我需要使用NLTK執行文本預處理任務，如句子拆分，標記化和標記。我想使用GENIA標記器進行標記。我正在使用Anaconda版本3.10並通過以下命令安裝geniatagger。在Anconda/NLTK中找不到Genia標記文件錯誤

python setup.py install

在IPython控制檯中，我輸入了以下代碼。

import geniatagger 
tagger =geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger') 
print tagger.parse('Welcome to natural language processing!')

按Enter鍵時出現以下錯誤消息。

--------------------------------------------------------------------------- 
WindowsError        Traceback (most recent call last) 
<ipython-input-2-52e4d65c2d02> in <module>() 
----> 1 tagger = geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger') 
    2 print tagger.parse('Welcome to natural language processing!') 
    3 

C:\Users\dell\Anaconda\lib\site-packages\geniatagger_python-0.1-py2.7.egg\geniatagger.pyc in __init__(self, path_to_tagger) 
19   self._tagger = subprocess.Popen('./'+os.path.basename(path_to_tagger), 
20           cwd=self._dir_to_tagger, 
---> 21           stdin=subprocess.PIPE, stdout=subprocess.PIPE) 
22 
23  def parse(self, text): 

C:\Users\dell\Anaconda\lib\subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags) 
708         p2cread, p2cwrite, 
709         c2pread, c2pwrite, 
--> 710         errread, errwrite) 
711   except Exception: 
712    # Preserve original exception in case os.close raises. 

C:\Users\dell\Anaconda\lib\subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, to_close, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite) 
956           env, 
957           cwd, 
--> 958           startupinfo) 
959    except pywintypes.error, e: 
960     # Translate pywintypes.error to WindowsError, which is 

WindowsError: [Error 2] The system cannot find the file specified

爲什麼我收到此錯誤信息？我怎樣才能解決這個問題？

如果我馬上使用這個標記，它是否也會執行標記化部分？

注意：geniatagger python文件位於'geniatagger'文件夾內。

來源

2015-08-18 Dakshila Kamalsooriya

我在cmd中試過這個，輸出是3.0.3 –

TL; DR：

# Install Genia Tagger (C code). 
$ git clone https://github.com/saffsd/geniatagger && cd geniatagger && make && cd .. 
# Install Genia Tagger (python wrapper) 
$ git clone https://github.com/informationsea/geniatagger-python.git && cd geniatagger-python && sudo python setup.py install && cd .. 
$ python 
>>> from geniatagger import GeniaTagger 
>>> tagger = GeniaTagger('./geniatagger/geniatagger') 
>>> loading morphdic...done. 
loading pos_models................done. 
loading chunk_models....done. 
loading named_entity_models..done. 

>>> print tagger.parse('This is a pen.') 
[('This', 'This', 'DT', 'B-NP', 'O'), ('is', 'be', 'VBZ', 'B-VP', 'O'), ('a', 'a', 'DT', 'B-NP', 'O'), ('pen', 'pen', 'NN', 'I-NP', 'O'), ('.', '.', '.', 'O', 'O')]

我不知道對於吉尼亞惡搞包是否起作用了從conda盒子，所以我覺得原來的Python/PIP的解決方法是簡單。首先，在NLTK（至少尚未=）中不支持Genia Tagger，所以它不是NLTK安裝/模塊的問題。

問題可能在於原始GeniaTagger C代碼使用的某些過時進口（http://www.nactem.ac.uk/tsujii/GENIA/tagger/）。

因此，要解決這個問題，必須添加#include <cstdlib>的原代碼但幸好@saffsd已經這樣做了，並把它很好地在他的GitHub庫（https://github.com/saffsd/geniatagger/blob/master/morph.cpp）

然後是安裝Python包裝，你可以：

從官方的PyPI安裝有：pip install https://pypi.python.org/packages/source/g/geniatagger-python/geniatagger-python-0.1.tar.gz
，或者使用一些其他的GitHub庫安裝，如首先出現的谷歌搜索

最後https://github.com/informationsea/geniatagger-python，在GeniaTagger初始化Python中是相當奇怪，因爲它並沒有真正採取的路徑，惡搞的目錄，但打標籤本身並假定模型文件在與標記器相同的目錄中，請參閱https://github.com/informationsea/geniatagger-python/blob/master/geniatagger.py#L19。

可能它預計在目錄路徑的第一級中會使用'./'，所以您必須初始化標記器爲GeniaTagger('./geniatagger/geniatagger')。

超出安裝問題。如果您使用GeniaTagger的python包裝，那麼GeniaTagger對象中只有一個函數，即parse()，當你使用parse()時，它會輸出每個句子的元組列表，輸入是一個句子字符串。在每個元組的項目有：

令牌（表面字）
引理（見Stemmers vs Lemmatizers）
POS標籤（看起來像賓州樹庫標記集，看到What are all possible pos tags of NLTK?）
名詞塊（見Output results in conll format (POS-tagging, stanford pos tagger)）
命名實體塊

來源

2015-08-18 17:39:26 alvas

在Anconda/NLTK中找不到Genia標記文件錯誤

回答

相關問題