2015-09-05 86 views
1

How to use the chinese model, and I download the "stanford-corenlp-3.5.2-models-chinese.jar" in my classpath and I copyAbout the Stanford CoreNLP in chinese model

<dependency> 
    <groupId>edu.stanford.nlp</groupId> 
    <artifactId>stanford-corenlp</artifactId> 
    <version>3.5.2</version> 
    <classifier>models-chinese</classifier> 
</dependency> 

to pom.xml file. In additional, my input.txt is

因出席中國大陸閱兵引發爭議的國民黨前主席連戰今晚金婚宴,立法院長王金平說,已向連戰恭喜,等一下回南部。 連戰夫婦今晚的50週年金婚紀念宴,正值連戰赴陸出席閱兵引發爭議之際,社會關注會否受到影響。 包括國民黨主席朱立倫、副主席郝龍斌等人已分別對外表示另有行程,無法出席。

then I compile the program using the code

java -cp "*" -Xmx1g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,ssplit -file input.txt 

and the result is as follows. But it gives the following error and how do i solve this problem?

C:\stanford-corenlp-full-2015-04-20>java -cp "*" -Xmx1g edu.stanford.nlp.pipelin 
e.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment, 
ssplit -file input.txt 
Registering annotator segment with class edu.stanford.nlp.pipeline.ChineseSegmen 
terAnnotator 
Adding annotator segment 
Loading Segmentation Model ... Loading classifier from edu/stanford/nlp/models/s 
egmenter/chinese/ctb.gz ... Loading Chinese dictionaries from 1 file: 
    edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz 
Done. Unique words in ChineseDictionary is: 423200. 
done [22.9 sec]. 

Ready to process: 1 files, skipped 0, total 1 
Processing file C:\stanford-corenlp-full-2015-04-20\input.txt ... writing to C:\ 
stanford-corenlp-full-2015-04-20\input.txt.xml { 
    Annotating file C:\stanford-corenlp-full-2015-04-20\input.txt Adding Segmentat 
ion annotation ... INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | u 
sePKChar2=false 
INFO: TagAffixDetector: building TagAffixDetector from edu/stanford/nlp/models/s 
egmenter/chinese/dict/character_list and edu/stanford/nlp/models/segmenter/chine 
se/dict/in.ctb 
Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese 
/dict/character_list 
Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in. 
ctb 
?]?X?u????j???\?L??o??????????e?D?u?s???????B?b?A??k?|???????????A?w?V?s?????A? 
[email protected]?U?^?n???C 
?s?????????50?g?~???B?????b?A????s??u???X?u?\?L??o???????A???|???`?|?_????v?T?C 
?]?A?????D?u?????B??D?u?q?s?y???H?w???O??~???t????{?A?L?k?X?u?C 

---> 
[?, ], ?, X, ?u????j???, \, ?L??o??????????e?D?u?s???????B?b?A??k?|???????????A? 
[email protected]?U?^?n???C, , , , ?s?????????, 50, ?, g?, ~, ???B?????b?A????s??u 
???X?u?, \, ?L??o???????A???, |, ???, `, ?, |, ?_????v?T?C, , , , ?, ], ?, A???? 
?D?u???, ??, B??D?u?q, ?, s?y???H?w???O??, ~, ???t????, {, ?, A?L?k?X?u?C] 

} 
Processed 1 documents 
Skipped 0 documents, error annotating 0 documents 
Annotation pipeline timing information: 
ChineseSegmenterAnnotator: 0.1 sec. 
TOTAL: 0.1 sec. for 34 tokens at 485.7 tokens/sec. 
Pipeline setup: 0.0 sec. 
Total time for StanfordCoreNLP pipeline: 0.1 sec. 

回答

2

I edited your question to change the command to the one that you actually used to produce the output shown. It looks like you worked out that the former command:

java -cp "*" -Xmx1g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt input.xml 

ran the English analysis pipeline, and that didn't work very well for Chinese text....

The CoreNLP support of Chinese in v3.5.2 is still a little rough, and will hopefully be a bit smoother in the next release. But from here you need to:

  • Specify a properties file for Chinese, giving appropriate models. (If no properties file is specified, CoreNLP defaults to English): -props StanfordCoreNLP-chinese.properties
  • At present, word segmentation of Chinese is not the annotator tokenize , but segment , specified as a custom annotator in StanfordCoreNLP-chinese.properties . (Maybe we'll unify the two in a future release...)
  • The current dcoref annotator only works for English. There is Chinese coreference, but it is not fully integrated into the pipeline. If you want to use it, you currently have to write some code, as explained here . So let's delete it. (Again, this should be better integrated in the future).
  • At that point, things run, but the ugly stderr output you show is that by default the segmenter has VERBOSE turned on, but your output character encoding is not right for our Chinese output. We should have VERBOSE off by default, but you can turn it off with: -segment.verbose false
  • We have no Chinese lemmatizer, so may as well delete that annotator.
  • Also, CoreNLP needs more than 1GB of RAM. Try 2GB.

At this point, all should be good! With the command:

java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,ssplit,pos,ner,parse -segment.verbose false -file input.txt

you get the output in input.txt.xml . (I'm not posting it, since it's a couple of thousand lines long....)

Update for C oreNLP v3.8.0: If using the (current in 2017) CoreNLP v3.8.0,那麼有一些變化/進展:(i)我們現在對所有語言使用註釋器tokenize,並且不需要爲中文加載自定義註釋器; (ii)默認情況下正確關閉詳細分段; (iii)[負面進展]的要求要求lemma註釋器在ner之前,即使它不適用於中文;和(iv)現在可以用於中文的共謀,作爲coref,它要求事先註釋員提及,其統計模型需要相當大的記憶。把這一切放在一起,你現在很好用這個命令:

java -cp "*" -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators tokenize,ssplit,pos,lemma,ner,parse,mention,coref -file input.txt

相關問題