2017-10-13 83 views
0

我正在嘗試使用NLTK Tokenize軟件包中的Stanford Segementer位。但是,我遇到了試圖使用基本測試集的問題。運行以下:NLTK Stanford Segmentor,如何設置CLASSPATH

# -*- coding: utf-8 -*- 
from nltk.tokenize.stanford_segmenter import StanfordSegmenter 
seg = StanfordSegmenter() 
seg.default_config('zh') 
sent = u'這是斯坦福中文分詞器測試' 
print(seg.segment(sent)) 

導致此錯誤: Error

我得到儘可能增加...

import os 
javapath = "C:/Users/User/Folder/stanford-segmenter-2017-06-09/*" 
os.environ['CLASSPATH'] = javapath 

...到我的代碼的前面,但這似乎沒有幫助。

如何讓分割器正常運行?

+0

的'CLASSPATH'應該是一個目錄(或幾個),而不是一個文件水珠。將它改爲'「C:/ Users/User/Folder/stanford-segmenter-2017-06-09」'看看是否有幫助。但可能還有其他問題,我不知道。 – alexis

+0

這似乎沒有幫助,但謝謝你。我可能試圖做太多而不瞭解事情是如何建立的。現在,我會研究使用不同的程序或軟件包。顯然,「解霸」是另一種Python選擇,無需使用Java調用。 – Savi

+0

適合自己。但是你有沒有在nltk的github網站上看過[安裝第三方軟件](https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software)? (我不知道爲什麼這個頁面沒有出現在nltk的常見問題頁面中) – alexis

回答

1

注意:此方法將只爲工作:

  • NLTK v3.2.5(v3.2.6將有一個更簡單的接口)
  • 斯坦福CoreNLP(版本> = 2016年10月31日)

首先,你必須獲取Java 8正確安裝第一,如果斯坦福CoreNLP工作在命令行中,NLTK v3.2.5斯坦福CoreNLP API是 如下。

注:你必須使用NLTK新CoreNLP API,開始在終端的CoreNLP服務器之前

英語

在終端:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip 
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31 

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ 
-preload tokenize,ssplit,pos,lemma,parse,depparse \ 
-status_port 9000 -port 9000 -timeout 15000 

在Python:

>>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger 
>>> stpos, stner = CoreNLPPOSTagger(), CoreNLPNERTagger() 
>>> stpos.tag('What is the airspeed of an unladen swallow ?'.split()) 
[(u'What', u'WP'), (u'is', u'VBZ'), (u'the', u'DT'), (u'airspeed', u'NN'), (u'of', u'IN'), (u'an', u'DT'), (u'unladen', u'JJ'), (u'swallow', u'VB'), (u'?', u'.')] 
>>> stner.tag('Rami Eid is studying at Stony Brook University in NY'.split()) 
[(u'Rami', u'PERSON'), (u'Eid', u'PERSON'), (u'is', u'O'), (u'studying', u'O'), (u'at', u'O'), (u'Stony', u'ORGANIZATION'), (u'Brook', u'ORGANIZATION'), (u'University', u'ORGANIZATION'), (u'in', u'O'), (u'NY', u'O')] 

中國

在終端:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip 
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31 
wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2016-10-31-models.jar 
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties 

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ 
-serverProperties StanfordCoreNLP-chinese.properties \ 
-preload tokenize,ssplit,pos,lemma,ner,parse \ 
-status_port 9001 -port 9001 -timeout 15000 

在Python

>>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger 
>>> from nltk.tokenize.stanford import CoreNLPTokenizer 
>>> stpos, stner = CoreNLPPOSTagger('http://localhost:9001'), CoreNLPNERTagger('http://localhost:9001') 
>>> sttok = CoreNLPTokenizer('http://localhost:9001') 

>>> sttok.tokenize(u'我家沒有電腦。') 
['我家', '沒有', '電腦', '。'] 

# Without segmentation (input to`raw_string_parse()` is a list of single char strings) 
>>> stpos.tag(u'我家沒有電腦。') 
[('我', 'PN'), ('家', 'NN'), ('沒', 'AD'), ('有', 'VV'), ('電', 'NN'), ('腦', 'NN'), ('。', 'PU')] 
# With segmentation 
>>> stpos.tag(sttok.tokenize(u'我家沒有電腦。')) 
[('我家', 'NN'), ('沒有', 'VE'), ('電腦', 'NN'), ('。', 'PU')] 

# Without segmentation (input to`raw_string_parse()` is a list of single char strings) 
>>> stner.tag(u'奧巴馬與邁克爾·傑克遜一起去雜貨店購物。') 
[('奧', 'GPE'), ('巴', 'GPE'), ('馬', 'GPE'), ('與', 'O'), ('邁', 'O'), ('克', 'PERSON'), ('爾', 'PERSON'), ('·', 'O'), ('傑', 'O'), ('克', 'O'), ('遜', 'O'), ('一', 'NUMBER'), ('起', 'O'), ('去', 'O'), ('雜', 'O'), ('貨', 'O'), ('店', 'O'), ('購', 'O'), ('物', 'O'), ('。', 'O')] 
# With segmentation 
>>> stner.tag(sttok.tokenize(u'奧巴馬與邁克爾·傑克遜一起去雜貨店購物。')) 
[('奧巴馬', 'PERSON'), ('與', 'O'), ('邁克爾·傑克遜', 'PERSON'), ('一起', 'O'), ('去', 'O'), ('雜貨店', 'O'), ('購物', 'O'), ('。', 'O')] 

德國

在終端:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip 
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31 

wget http://nlp.stanford.edu/software/stanford-german-corenlp-2016-10-31-models.jar 
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-german.properties 

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ 
-serverProperties StanfordCoreNLP-german.properties \ 
-preload tokenize,ssplit,pos,ner,parse \ 
-status_port 9002 -port 9002 -timeout 15000 

在Python:

>>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger 
>>> stpos, stner = CoreNLPPOSTagger('http://localhost:9002'), CoreNLPNERTagger('http://localhost:9002') 

>>> stpos.tag('Ich bin schwanger'.split()) 
[('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')] 

>>> stner.tag('Donald Trump besuchte Angela Merkel in Berlin.'.split()) 
[('Donald', 'I-PER'), ('Trump', 'I-PER'), ('besuchte', 'O'), ('Angela', 'I-PER'), ('Merkel', 'I-PER'), ('in', 'O'), ('Berlin', 'I-LOC'), ('.', 'O')] 

西班牙語

在終端:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip 
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31 

wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2016-10-31-models.jar 
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-spanish.properties 

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ 
-serverProperties StanfordCoreNLP-spanish.properties \ 
-preload tokenize,ssplit,pos,ner,parse \ 
-status_port 9003 -port 9003 -timeout 15000 

在Python:

>>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger 
>>> stpos, stner = CoreNLPPOSTagger('http://localhost:9003'), CoreNLPNERTagger('http://localhost:9003') 

>>> stner.tag(u'Barack Obama salió con Michael Jackson .'.split()) 
[(u'Barack', u'PERS'), (u'Obama', u'PERS'), (u'sali\xf3', u'O'), (u'con', u'O'), (u'Michael', u'PERS'), (u'Jackson', u'PERS'), (u'.', u'O')] 

>>> stpos.tag(u'Barack Obama salió con Michael Jackson .'.split()) 
[(u'Barack', u'np00000'), (u'Obama', u'np00000'), (u'sali\xf3', u'vmis000'), (u'con', u'sp000'), (u'Michael', u'np00000'), (u'Jackson', u'np00000'), (u'.', u'fp')] 

法語

在終端:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip 
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31 

wget http://nlp.stanford.edu/software/stanford-french-corenlp-2016-10-31-models.jar 
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-french.properties 

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ 
-serverProperties StanfordCoreNLP-french.properties \ 
-preload tokenize,ssplit,pos,parse \ 
-status_port 9004 -port 9004 -timeout 15000 

在Python:

>>> from nltk.tag.stanford import CoreNLPPOSTagger 
>>> stpos = CoreNLPPOSTagger('http://localhost:9004') 
>>> stpos.tag('Je suis enceinte'.split()) 
[(u'Je', u'CLS'), (u'suis', u'V'), (u'enceinte', u'NC')] 

阿拉伯

在終端:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip 
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31 

wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2016-10-31-models.jar 
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties 

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ 
-serverProperties StanfordCoreNLP-french.properties \ 
-preload tokenize,ssplit,pos,parse \ 
-status_port 9005 -port 9005 -timeout 15000 

在Python:

>>> from nltk.tag.stanford import CoreNLPPOSTagger 
>>> from nltk.tokenize.stanford import CoreNLPTokenizer 
>>> sttok = CoreNLPTokenizer('http://localhost:9005') 
>>> stpos = CoreNLPPOSTagger('http://localhost:9005') 
>>> text = u'انا حامل' 
>>> stpos.tag(sttok.tokenize(text)) 
[('انا', 'DET'), ('حامل', 'NC')]