2013-12-14 64 views
0

我想提取文本的標記化String的跨度。使用斯坦福大學的CoreNLP,我有:CoreNLP提取標記的跨度

Properties props; 
props = new Properties(); 
props.put("annotators", "tokenize, ssplit, pos, lemma"); 
this.pipeline = new StanfordCoreNLP(props); 

String answerText = "This is the answer"; 
ArrayList<IntPair> tokenSpans = new ArrayList<IntPair>(); 
// create an empty Annotation with just the given text 
Annotation document = new Annotation(answerText); 
// run all Annotators on this text 
this.pipeline.annotate(document); 

// Iterate over all of the sentences 
List<CoreMap> sentences = document.get(SentencesAnnotation.class); 
for(CoreMap sentence: sentences) { 
    // Iterate over all tokens in a sentence 
    for (CoreLabel fullToken: sentence.get(TokensAnnotation.class)) { 
     IntPair span = fullToken.get(SpanAnnotation.class); 
     tokenSpans.add(span); 
    } 
} 

然而,所有的IntPairs的是null。我是否需要在該行再添annotator

props.put("annotators", "tokenize, ssplit, pos, lemma"); 

所需的輸出:

(0,3), (5,6), (8,10), (12,17) 

回答

0

的問題是使用SpanAnnotation,它適用於Trees。此查詢的正確等級是CharacterOffsetBeginAnnotationCharacterOffsetEndAnnotation