2017-07-31 38 views
1

我在自然語言處理(NLP)新的,我想要做的部分詞性標註(POS),然後就找內的特定結構文本。我可以用斯坦福NLP管理詞性標註,但是,我不知道如何提取這種結構:提取物基於POS一個語言結構標記使用斯坦福句子NLP在JAVA

NN/NNS + IN + DT + NN/NNS/NNP/NNPS

public static void main(String args[]) throws Exception{ 
    //input File 
    String contentFilePath = ""; 
    //outputFile 
    String triplesFilePath = contentFilePath.substring(0, contentFilePath.length()-4)+"_postagg.txt"; 

    //document to POS tagging 
    String content = getFileContent(contentFilePath); 

    Properties props = new Properties(); 

    props.setProperty("annotators","tokenize, ssplit, pos"); 
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 
    // Annotate the document. 
    Annotation doc = new Annotation(content); 
    pipeline.annotate(doc); 


    // Annotate the document. 
    List<CoreMap> sentences = doc.get(CoreAnnotations.SentencesAnnotation.class); 
    for (CoreMap sentence : sentences) { 
     for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) { 
      String word = token.get(CoreAnnotations.TextAnnotation.class); 
      // this is the POS tag of the token 
      String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class); 
      System.out.println(word + "/" + pos); 
     } }}} 
+0

我剛剛意識到,判定器的POS標記是「DT」,而不是「DET」。我糾正我的回答如下,它的工作現在。 –

回答

1

你可以簡單地遍歷你的句子並檢查POS標籤。如果他們滿足您的要求,您可以提取這種結構。代碼可能是這樣的:

for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) { 
    List<CoreLabel> tokens = sentence.get(TokensAnnotation.class); 
    for(int i = 0; i < tokens.size() - 3; i++) { 
     String pos = tokens.get(i).get(PartOfSpeechAnnotation.class); 
     if(pos.equals("NN") || pos.equals("NNS")) { 
      pos = tokens.get(i + 1).getString(PartOfSpeechAnnotation.class); 
      if(pos.equals("IN")) { 
       pos = tokens.get(i + 2).getString(PartOfSpeechAnnotation.class); 
       if(pos.equals("DT")) { 
        pos = tokens.get(i + 3).getString(PartOfSpeechAnnotation.class); 
        if(pos.contains("NN")) { 
         //We have a match starting at index i and ending at index i + 3 
         String word1 = tokens.get(i).getString(TextAnnotation.class); 
         String word2 = tokens.get(i + 1).getString(TextAnnotation.class); 
         String word3 = tokens.get(i + 2).getString(TextAnnotation.class); 
         String word4 = tokens.get(i + 3).getString(TextAnnotation.class); 
         System.out.println(word1 + " " + word2 + " " + word3 + " " + word4); 
        } 
       } 
      } 
     } 
    } 
} 
+0

爲什麼它不與DET工作? – Raha1986

+0

斯坦福大學的POS標籤取自[Penn Treebank POS Tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)。他們指定一個判斷爲「DT」的標籤。沒有「DET」標籤。 –