斯坦福核心NLP - 理解指代消解

我遇到一些麻煩了解在斯坦福NLP工具的最後一個版本到COREF解析器所做的更改。作爲一個例子，下面是一個句子和相應的CorefChainAnnotation：斯坦福核心NLP - 理解指代消解

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons. 

{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}

我不知道我理解這些數字的含義。查看源代碼也沒有任何幫助。

謝謝

來源

2011-07-04 pnsilva

第一個數字是（表示令牌，這代表了相同實體）的羣集ID，見SieveCoreferenceSystem#coref(Document)源代碼。這對數字outout CorefChain＃的toString（）的：

public String toString(){ 
    return position.toString(); 
}

，其中位置是一組實體現在的位置是對提的（讓他們使用CorefChain.getCorefMentions()）。下面是一個完整的代碼（groovy），這表明如何從位置標記的例子：

class Example { 
    public static void main(String[] args) { 
     Properties props = new Properties(); 
     props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); 
     props.put("dcoref.score", true); 
     pipeline = new StanfordCoreNLP(props); 
     Annotation document = new Annotation("The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons."); 

     pipeline.annotate(document); 
     Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class); 

     println aText 

     for(Map.Entry<Integer, CorefChain> entry : graph) { 
      CorefChain c = entry.getValue();     
      println "ClusterId: " + entry.getKey(); 
      CorefMention cm = c.getRepresentativeMention(); 
      println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex); 

      List<CorefMention> cms = c.getCorefMentions(); 
      println "Mentions: "; 
      cms.each { it -> 
       print aText.subSequence(it.startIndex, it.endIndex) + "|"; 
      }   
     } 
    } 
}

輸出（我不明白其中「s」來自）：

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons. 
ClusterId: 1 
Representative Mention: he 
Mentions: he|atom |s| 
ClusterId: 6 
Representative Mention: basic unit 
Mentions: basic unit | 
ClusterId: 8 
Representative Mention: unit 
Mentions: unit | 
ClusterId: 10 
Representative Mention: it 
Mentions: it |

來源

2011-07-06 12:42:35 Skarab

ps。我認爲默認設置（模型）不適用於您的域。 stanford核心nlp似乎更適合從新聞，文章等中提取語義。例如，Stanford NER--核心NLP的一部分 - 在CoNLL 2002和2003語料庫上進行了訓練並進行了測試。 – Skarab

這個算法是部分有用的，並且使我找到了正確的算法，但是這裏的輸出對於句子來說是不正確的，在句子或者「s」中沒有「他」，並且「it」恰好映射到它本身，共同決議的重點。 – user1084563

我認爲你認爲'startIndex'和'endIndex'就好像它們是字符索引（從0開始），但它們是標記索引（從1開始）。另外，你沒有定義'aText'。假設你的意思是註解中的文字，而不是「他」（字符1和2），你應該有「原子」（單詞1和2）等。 –

我一直在與共參照依賴關係圖，我開始利用對方的回答了這個問題。過了一段時間，雖然我意識到上述算法並不完全正確。它產生的輸出甚至與我所修改的版本差不多。

對於使用這篇文章的其他人來說，這裏是我結束的算法，它也過濾掉了自引用，因爲每個代表性的提示也提到了自身，很多提到的只是引用自己。

Map<Integer, CorefChain> coref = document.get(CorefChainAnnotation.class); 

for(Map.Entry<Integer, CorefChain> entry : coref.entrySet()) { 
    CorefChain c = entry.getValue(); 

    //this is because it prints out a lot of self references which aren't that useful 
    if(c.getCorefMentions().size() <= 1) 
     continue; 

    CorefMention cm = c.getRepresentativeMention(); 
    String clust = ""; 
    List<CoreLabel> tks = document.get(SentencesAnnotation.class).get(cm.sentNum-1).get(TokensAnnotation.class); 
    for(int i = cm.startIndex-1; i < cm.endIndex-1; i++) 
     clust += tks.get(i).get(TextAnnotation.class) + " "; 
    clust = clust.trim(); 
    System.out.println("representative mention: \"" + clust + "\" is mentioned by:"); 

    for(CorefMention m : c.getCorefMentions()){ 
     String clust2 = ""; 
     tks = document.get(SentencesAnnotation.class).get(m.sentNum-1).get(TokensAnnotation.class); 
     for(int i = m.startIndex-1; i < m.endIndex-1; i++) 
      clust2 += tks.get(i).get(TextAnnotation.class) + " "; 
     clust2 = clust2.trim(); 
     //don't need the self mention 
     if(clust.equals(clust2)) 
      continue; 

     System.out.println("\t" + clust2); 
    } 
}

併爲您的例句最終輸出如下：

representative mention: "a basic unit of matter" is mentioned by: 
The atom 
it

通常的「原子」，最終被代表提及，但在情況下，它不會令人驚訝。輸出結果稍微更精確的另一個例子如下：

革命戰爭發生在1700年代，這是在美國的第一場戰爭。

產生以下輸出：

representative mention: "The Revolutionary War" is mentioned by: 
it 
the first war in the United States

來源

2011-12-16 13:43:58 user1084563

這些是從註釋器最近的結果。

[1,1] 1所述的原子
[1,2] 1物質的一個基本單元
[1，3] 1它
[1,6] 6個帶負電荷的電子
[1，5] 5帶負電荷的電子雲

該標記如下：

[Sentence number,'id'] Cluster_no Text_Associated

屬於同一羣集的文本引用相同的上下文。

來源

2017-07-18 07:00:50 Purvanshi

斯坦福核心NLP - 理解指代消解

回答

相關問題