提取物基於POS一個語言結構標記使用斯坦福句子NLP在JAVA

我在自然語言處理（NLP）新的，我想要做的部分詞性標註（POS），然後就找內的特定結構文本。我可以用斯坦福NLP管理詞性標註，但是，我不知道如何提取這種結構：提取物基於POS一個語言結構標記使用斯坦福句子NLP在JAVA

NN/NNS + IN + DT + NN/NNS/NNP/NNPS

public static void main(String args[]) throws Exception{ 
    //input File 
    String contentFilePath = ""; 
    //outputFile 
    String triplesFilePath = contentFilePath.substring(0, contentFilePath.length()-4)+"_postagg.txt"; 

    //document to POS tagging 
    String content = getFileContent(contentFilePath); 

    Properties props = new Properties(); 

    props.setProperty("annotators","tokenize, ssplit, pos"); 
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 
    // Annotate the document. 
    Annotation doc = new Annotation(content); 
    pipeline.annotate(doc); 


    // Annotate the document. 
    List<CoreMap> sentences = doc.get(CoreAnnotations.SentencesAnnotation.class); 
    for (CoreMap sentence : sentences) { 
     for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) { 
      String word = token.get(CoreAnnotations.TextAnnotation.class); 
      // this is the POS tag of the token 
      String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class); 
      System.out.println(word + "/" + pos); 
     } }}}

來源

2017-07-31 Raha1986

我剛剛意識到，判定器的POS標記是「DT」，而不是「DET」。我糾正我的回答如下，它的工作現在。 –

你可以簡單地遍歷你的句子並檢查POS標籤。如果他們滿足您的要求，您可以提取這種結構。代碼可能是這樣的：

for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) { 
    List<CoreLabel> tokens = sentence.get(TokensAnnotation.class); 
    for(int i = 0; i < tokens.size() - 3; i++) { 
     String pos = tokens.get(i).get(PartOfSpeechAnnotation.class); 
     if(pos.equals("NN") || pos.equals("NNS")) { 
      pos = tokens.get(i + 1).getString(PartOfSpeechAnnotation.class); 
      if(pos.equals("IN")) { 
       pos = tokens.get(i + 2).getString(PartOfSpeechAnnotation.class); 
       if(pos.equals("DT")) { 
        pos = tokens.get(i + 3).getString(PartOfSpeechAnnotation.class); 
        if(pos.contains("NN")) { 
         //We have a match starting at index i and ending at index i + 3 
         String word1 = tokens.get(i).getString(TextAnnotation.class); 
         String word2 = tokens.get(i + 1).getString(TextAnnotation.class); 
         String word3 = tokens.get(i + 2).getString(TextAnnotation.class); 
         String word4 = tokens.get(i + 3).getString(TextAnnotation.class); 
         System.out.println(word1 + " " + word2 + " " + word3 + " " + word4); 
        } 
       } 
      } 
     } 
    } 
}

來源

2017-07-31 12:11:56

爲什麼它不與DET工作？ – Raha1986

斯坦福大學的POS標籤取自[Penn Treebank POS Tags]（https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html）。他們指定一個判斷爲「DT」的標籤。沒有「DET」標籤。 –

提取物基於POS一個語言結構標記使用斯坦福句子NLP在JAVA

回答

相關問題