使用正則表達式提取短語結構樹中的葉節點

我想在Java中使用正則表達式來提取句子或短語結構樹中的葉節點。例如，給一句「這是一個簡單的句子。」使用正則表達式提取短語結構樹中的葉節點

我有句法信息

輸入： (ROOT (S (NP (DT This)) (VP (VBZ is) (NP (DT an) (JJ easy) (NN sentence))) (. .)))

我想用正則表達式來提取葉節點

輸出：

DT This 
VBZ is 
DT an 
JJ easy 
NN sentence 
. .

來源

2013-02-23 Just life

http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html – 2013-02-23 18:15:56

如果您沒有nes那麼你可以使用這個：

(?<=\()[^()]+(?=\))

查看here on Regexr。

(?<=\()是lookbehind assertion，確保一個「（」在比賽前

(?=\))是lookahead assertion，即確保了「）」之後的匹配

[^()]+是negated character class，匹配（一個或更多）任何字符，但括號。

來源

2013-02-23 18:47:39 stema

這非常有用！謝謝stema！ – 2013-02-23 19:05:20

你需要的正則表達式是\(([^ ]+) +([^()]+)\)

它將：
\(匹配開括號，
([^ ]+)然後一個或多於一個空間其它更多字符（並調用它組＃1），
+然後一個或多個空格，
([^()]+)然後一個或多個字符以外的括號（稱爲組＃2），
\)，最後是一個右括號。

在Java中使用它，預編譯的類模式：

static final Pattern leaf = Pattern.compile("\\(([^ ]+) +([^()]+)\\)");

然後創建了每一個輸入字符串和循環在其發現方法的匹配：

Matcher m = leaf.matcher(input); 
while (m.find()) { 
    // here do something with each leaf, 
    // where m.group(1) is the node type (DT, VBZ...) 
    // and m.group(2) is the word 
}

來源

2013-02-23 18:50:49 Tobia

感謝託比您的熱情。這非常有幫助！ – 2013-02-23 19:04:30

假設你正在使用Stanford NLP基於與此問題相關的標籤：

更簡單的方法是使用樹類中的內置方法getLeaves()。

來源

2014-02-04 23:57:13

使用正則表達式提取短語結構樹中的葉節點

回答

相關問題