在斯坦福CoreNLP分詞器中設置選項

我修改了來自here的Prof. Mannings代碼示例以讀取文件，標記語音，詞性標記以及將其推理。在斯坦福CoreNLP分詞器中設置選項

現在我遇到了無法識別的字符問題，我想使用「untokedizable」選項並將其設置爲「noneKeep」。

關於StackOverflow的其他問題解釋了我需要自己實例化標記器。但是，我不確定如何做到這一點，以便下列任務（POS標記等）仍然根據需要執行。任何人都可以將我指向正確的方向嗎？

// expects two command line parameters: one file to be read, one to write to 

import java.io.*; 
import java.util.*; 

import edu.stanford.nlp.io.*; 
import edu.stanford.nlp.ling.*; 
import edu.stanford.nlp.pipeline.*; 
import edu.stanford.nlp.trees.*; 
import edu.stanford.nlp.util.*; 

public class StanfordCoreNlpDemo { 

    public static void main(String[] args) throws IOException { 
    PrintWriter out; 
    out = new PrintWriter(args[1]); 

    Properties props = new Properties(); 
    props.setProperty("annotators", "tokenize, ssplit, pos, lemma"); 
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 
    Annotation annotation; 
    annotation = new Annotation(IOUtils.slurpFileNoExceptions(args[0])); 

    pipeline.annotate(annotation); 
    pipeline.prettyPrint(annotation, out); 
    } 
}

來源

2017-05-12 user1769925

添加到您的代碼：

props.setProperty("tokenize.options", "untokenizable=allKeep");

6個選項untokenizable是：

noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep

來源

2017-05-12 22:47:13 StanfordNLPHelp

這是令人失望的，幾乎方便。非常感謝你！ – user1769925

在斯坦福CoreNLP分詞器中設置選項

回答

相關問題