提取從POS標籤的名詞性詞語與原句

我想提取句子中的名詞，並從POS標籤提取從POS標籤的名詞性詞語與原句

//Extract the words before _NNP & _NN from below and also how to get back the original sentence from the Pos TAG. 
Original Sentence:Hi. How are you? This is Mike· 
POSTag: Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN

找回原句我想是這樣的

String txt = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN"; 


    String re1 = "((?:[a-z][a-z0-9_]*))"; // Variable Name 1 
    String re2 = ".*?"; // Non-greedy match on filler 
    String re3 = "(_)"; // Any Single Character 1 
    String re4 = "(NNP)"; // Word 1 

    Pattern p = Pattern.compile(re1 + re2 + re3 + re4, Pattern.CASE_INSENSITIVE | Pattern.DOTALL); 
    Matcher m = p.matcher(txt); 
    if (m.find()) { 
     String var1 = m.group(1); 
     System.out.print( var1.toString() ); 
    } 
}

輸出：嗨但我需要一個列表中的所有名詞。

來源

2013-09-30 srp

您是否嘗試過什麼了嗎？ '[a-zA-Z]（？= [。] _ NN）'將捕獲任何後跟'._NN'的alphachar-string，也許你可以從頭開始。 – sp00m

感謝您的回覆。 – srp

你的例子中有一個錯字。在第一個街區，「邁克。」之後是「_NN」，但在第二個塊中後面跟着「_NNP」。 –

爲了提取名詞，你可以這樣做：

public static String[] extractNouns(String sentenceWithTags) { 
    // Split String into array of Strings whenever there is a tag that starts with "._NN" 
    // followed by zero, one or two more letters (like "_NNP", "_NNPS", or "_NNS") 
    String[] nouns = sentenceWithTags.split("_NN\\w?\\w?\\b"); 
    // remove all but last word (which is the noun) in every String in the array 
    for(int index = 0; index < nouns.length; index++) { 
     nouns[index] = nouns[index].substring(nouns[index].lastIndexOf(" ") + 1) 
     // Remove all non-word characters from extracted Nouns 
     .replaceAll("[^\\p{L}\\p{Nd}]", ""); 
    } 
    return nouns; 
}

要提取原句，你可以這樣做：

public static String extractOriginal(String sentenceWithTags) { 
    return sentenceWithTags.replaceAll("_([A-Z]*)\\b", ""); 
}

證明，它的工作原理：

public static void main(String[] args) { 
    String sentence = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN"; 
    System.out.println(java.util.Arrays.toString(extractNouns(sentence))); 
    System.out.println(extractOriginal(sentence)); 
}

輸出：

[Hi, Mike] 
Hi. How are you? This is Mike.

注意：用於從所提取的名詞即除去所有非單詞字符（如標點符號）正則表達式中，我使用this Stack Overflow question/answer。

來源

2013-09-30 14:29:07

謝謝你的回覆James，它完美適用於名詞NN，但我需要所有這些NN - 名詞，單數或質量，NNP-專有名詞，單數，NNPS-專有名詞，複數，NNS-名詞，複數而不是句子中的錯字。 – srp

我修正了打字錯誤，但是我怎樣分割不同的名詞，因爲目前它只與名詞結尾_NN分開，我需要用_NNP，_NNPS，_NNS提取名詞。 – srp

@srp噢，好的。在這種情況下，只需添加「\\ w？\\ w？」到正則表達式的結尾「_NN \\ b」。「\\ w」找到一個單詞字符，並且「？」意味着零次或一次出現，所以這會找到「_NN」後跟零個，一個或兩個單詞字符。更新答案。 –

使用while (m.find())而不是if (m.find())來遍歷所有匹配。

而且，你的正則表達式可以真正簡化：

，如果你不需要捕獲數據，只是不要把括號（通常）

((?:...))

這是相當奇怪：直接嵌套在捕獲組中的非捕獲組沒有意義。
我不確定.*?部分是否符合您的預期。如果要匹配一個點，請改爲使用[.]。

因此，請嘗試([a-z][a-z0-9_]*)[.]_NNP。

甚至使用積極的前瞻：[a-z][a-z0-9_]*(?=[.]_NNP)。使用m.group()訪問捕獲的數據。

來源

2013-09-30 14:18:58 sp00m

謝謝你的答覆。 – srp

這一個應該工作

import java.util.ArrayList; 
public class Test { 

public static final String NOUN_REGEX = "[a-zA-Z]*_NN\\w?\\w?\\b"; 

public static ArrayList<String> extractNounsByRegex(String sentenceWithTags) { 
    ArrayList<String> nouns = new ArrayList<String>(); 
    String[] words = sentenceWithTags.split("\\s+"); 
    for (int i = 0; i < words.length; i++) { 
     if(words[i].matches(NOUN_REGEX)) { 
      System.out.println(" Matched "); 
      //remove the suffix _NN* and retain [a-zA-Z]* 
       nouns.add(words[i].replaceAll("_NN\\w?\\w?\\b", "")); 
      } 
     } 
     return nouns; 
    } 

    public static String extractOriginal(String word) { 
       return word.replaceAll("_NN\\w?\\w?\\b", ""); 
    } 

    public static void main(String[] args) { 
     //  String sentence = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN"; 
     String sentence = "Eiffel_NNP tower_NN is_VBZ in_IN paris_NN Hi_NNP How_WRB are_VBP you_PRP This_DT is_VBZ Mike_NNP Barrack_NNP Obama_NNP is_VBZ a_DT president_NN this_VBZ"; 
     System.out.println(extractNounsByRegex(sentence).toString()); 
     System.out.println(sentence); 
    } 
}

來源

2013-10-02 11:27:04 pshirishreddy

謝謝你的回覆 – srp

提取從POS標籤的名詞性詞語與原句

回答

相關問題