第一個數字是(表示令牌,這代表了相同實體)的羣集ID,見SieveCoreferenceSystem#coref(Document)
源代碼。這對數字outout CorefChain#的toString()的:
public String toString(){
return position.toString();
}
,其中位置是一組實體現在的位置是對提的(讓他們使用CorefChain.getCorefMentions()
)。下面是一個完整的代碼(groovy),這表明如何從位置標記的例子:
class Example {
public static void main(String[] args) {
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
props.put("dcoref.score", true);
pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation("The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.");
pipeline.annotate(document);
Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
println aText
for(Map.Entry<Integer, CorefChain> entry : graph) {
CorefChain c = entry.getValue();
println "ClusterId: " + entry.getKey();
CorefMention cm = c.getRepresentativeMention();
println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex);
List<CorefMention> cms = c.getCorefMentions();
println "Mentions: ";
cms.each { it ->
print aText.subSequence(it.startIndex, it.endIndex) + "|";
}
}
}
}
輸出(我不明白其中「s」來自):
The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
ClusterId: 1
Representative Mention: he
Mentions: he|atom |s|
ClusterId: 6
Representative Mention: basic unit
Mentions: basic unit |
ClusterId: 8
Representative Mention: unit
Mentions: unit |
ClusterId: 10
Representative Mention: it
Mentions: it |
ps。我認爲默認設置(模型)不適用於您的域。 stanford核心nlp似乎更適合從新聞,文章等中提取語義。例如,Stanford NER--核心NLP的一部分 - 在CoNLL 2002和2003語料庫上進行了訓練並進行了測試。 – Skarab
這個算法是部分有用的,並且使我找到了正確的算法,但是這裏的輸出對於句子來說是不正確的,在句子或者「s」中沒有「他」,並且「it」恰好映射到它本身,共同決議的重點。 – user1084563
我認爲你認爲'startIndex'和'endIndex'就好像它們是字符索引(從0開始),但它們是標記索引(從1開始)。另外,你沒有定義'aText'。假設你的意思是註解中的文字,而不是「他」(字符1和2),你應該有「原子」(單詞1和2)等。 –