2016-02-13 24 views
1

我試圖將Tree對象中的葉值作爲字符串獲取。這裏的樹對象是斯坦福分析器的輸出。將NLTK樹葉值作爲字符串獲取

這裏是我的代碼:

from nltk.parse import stanford 
Parser = stanford.StanfordParser("path") 


example = "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back" 
sentences = Parser.raw_parse(example) 
for line in sentences: 
    for sentence in line: 
     tree = sentence 

這就是我如何提取VP(動詞短語)離開。

VP=[] 

VP_tree = list(tree.subtrees(filter=lambda x: x.label()=='VP')) 

for i in VP_tree: 
    VP.append(' '.join(i.flatten())) 

這裏是i.flatten()看​​起來像:(返回解析的單詞列表)

(VP 
    constructed 
    logistic 
    regression 
    , 
    calibrated 
    the 
    low 
    defaults 
    portfolio 
    to 
    benchmark 
    ratings) 

監守我只能讓他們爲分詞列表,我加入他們' 」。因此在「迴歸」和「,」之間有一個空格。

In [33]: VP 
Out [33]: [u'constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings'] 

我想獲得動詞短語作爲字符串(不作爲的分詞的列表),而不必通過'加入他們的行列。

我已經看過Tree類下的方法(http://www.nltk.org/_modules/nltk/tree.html),但是到目前爲止沒有運氣。

回答

2

要檢索的字符串作爲每輸入位置,你應該考慮使用https://github.com/smilli/py-corenlp代替NLTK API斯坦福工具。

首先,你必須下載,安裝和設置斯坦福CoreNLP,看到http://stanfordnlp.github.io/CoreNLP/corenlp-server.html#getting-started

然後安裝Python包裝到CoreNLP,https://github.com/smilli/py-corenlp

然後,啓動服務器(很多人忘記這一步之後, !),在Python中,你可以這樣做:

>>> from pycorenlp import StanfordCoreNLP 
>>> stanford = StanfordCoreNLP('http://localhost:9000') 
>>> text = ("Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back") 
>>> output = stanford.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,depparse,parse', 'outputFormat': 'json'}) 
>>> print(output['sentences'][0]['parse']) 
(ROOT 
    (SINV 
    (VP (VBN Selected) 
     (NP (NNS variables)) 
     (PP (IN by) 
     (NP 
      (NP (JJ univariate/multivariate) (NN analysis)) 
      (, ,) 
      (VP (VBN constructed) 
      (NP (JJ logistic) (NN regression))) 
      (, ,)))) 
    (VP (VBD calibrated)) 
    (NP 
     (NP 
     (NP (DT the) (JJ low) (NNS defaults) (NN portfolio)) 
     (PP (TO to) 
      (NP (JJ benchmark) (NNS ratings)))) 
     (, ,) 
     (VP (VBN performed) 
     (ADVP (RB back)))))) 

要檢索VP字符串作爲每輸入字符串,你就必須使用遍歷JSON輸出characterOffsetBegincharacterOffsetEnd

>>> output['sentences'][0] 
{u'tokens': [{u'index': 1, u'word': u'Selected', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 8, u'characterOffsetBegin': 0, u'originalText': u'Selected', u'before': u''}, {u'index': 2, u'word': u'variables', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 18, u'characterOffsetBegin': 9, u'originalText': u'variables', u'before': u' '}, {u'index': 3, u'word': u'by', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 21, u'characterOffsetBegin': 19, u'originalText': u'by', u'before': u' '}, {u'index': 4, u'word': u'univariate/multivariate', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 45, u'characterOffsetBegin': 22, u'originalText': u'univariate/multivariate', u'before': u' '}, {u'index': 5, u'word': u'analysis', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 54, u'characterOffsetBegin': 46, u'originalText': u'analysis', u'before': u' '}, {u'index': 6, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 55, u'characterOffsetBegin': 54, u'originalText': u',', u'before': u''}, {u'index': 7, u'word': u'constructed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 67, u'characterOffsetBegin': 56, u'originalText': u'constructed', u'before': u' '}, {u'index': 8, u'word': u'logistic', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 76, u'characterOffsetBegin': 68, u'originalText': u'logistic', u'before': u' '}, {u'index': 9, u'word': u'regression', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 87, u'characterOffsetBegin': 77, u'originalText': u'regression', u'before': u' '}, {u'index': 10, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 88, u'characterOffsetBegin': 87, u'originalText': u',', u'before': u''}, {u'index': 11, u'word': u'calibrated', u'after': u' ', u'pos': u'VBD', u'characterOffsetEnd': 99, u'characterOffsetBegin': 89, u'originalText': u'calibrated', u'before': u' '}, {u'index': 12, u'word': u'the', u'after': u' ', u'pos': u'DT', u'characterOffsetEnd': 103, u'characterOffsetBegin': 100, u'originalText': u'the', u'before': u' '}, {u'index': 13, u'word': u'low', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 107, u'characterOffsetBegin': 104, u'originalText': u'low', u'before': u' '}, {u'index': 14, u'word': u'defaults', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 116, u'characterOffsetBegin': 108, u'originalText': u'defaults', u'before': u' '}, {u'index': 15, u'word': u'portfolio', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 126, u'characterOffsetBegin': 117, u'originalText': u'portfolio', u'before': u' '}, {u'index': 16, u'word': u'to', u'after': u' ', u'pos': u'TO', u'characterOffsetEnd': 129, u'characterOffsetBegin': 127, u'originalText': u'to', u'before': u' '}, {u'index': 17, u'word': u'benchmark', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 139, u'characterOffsetBegin': 130, u'originalText': u'benchmark', u'before': u' '}, {u'index': 18, u'word': u'ratings', u'after': u'', u'pos': u'NNS', u'characterOffsetEnd': 147, u'characterOffsetBegin': 140, u'originalText': u'ratings', u'before': u' '}, {u'index': 19, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 148, u'characterOffsetBegin': 147, u'originalText': u',', u'before': u''}, {u'index': 20, u'word': u'performed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 158, u'characterOffsetBegin': 149, u'originalText': u'performed', u'before': u' '}, {u'index': 21, u'word': u'back', u'after': u'', u'pos': u'RB', u'characterOffsetEnd': 163, u'characterOffsetBegin': 159, u'originalText': u'back', u'before': u' '}], u'index': 0, u'basic-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}], u'parse': u'(ROOT\n (SINV\n (VP (VBN Selected)\n  (NP (NNS variables))\n  (PP (IN by)\n  (NP\n   (NP (JJ univariate/multivariate) (NN analysis))\n   (, ,)\n   (VP (VBN constructed)\n   (NP (JJ logistic) (NN regression)))\n   (, ,))))\n (VP (VBD calibrated))\n (NP\n  (NP\n  (NP (DT the) (JJ low) (NNS defaults) (NN portfolio))\n  (PP (TO to)\n   (NP (JJ benchmark) (NNS ratings))))\n  (, ,)\n  (VP (VBN performed)\n  (ADVP (RB back))))))', u'collapsed-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod:by', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod:to', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}], u'collapsed-ccprocessed-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod:by', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod:to', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}]} 

但由於沒有將分析樹直接鏈接到偏移量,所以它似乎不容易解析得到字符偏移量。只有依賴關係三元組包含指向鏈接到偏移量的單詞ID的鏈接。


要訪問令牌和'after''before'鍵在output['sentences'][0]['tokens'](但遺憾的是沒有直接鏈接到解析樹):

>>> tokens = output['sentences'][0]['tokens'] 
>>> tokens 
[{u'index': 1, u'word': u'Selected', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 8, u'characterOffsetBegin': 0, u'originalText': u'Selected', u'before': u''}, {u'index': 2, u'word': u'variables', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 18, u'characterOffsetBegin': 9, u'originalText': u'variables', u'before': u' '}, {u'index': 3, u'word': u'by', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 21, u'characterOffsetBegin': 19, u'originalText': u'by', u'before': u' '}, {u'index': 4, u'word': u'univariate/multivariate', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 45, u'characterOffsetBegin': 22, u'originalText': u'univariate/multivariate', u'before': u' '}, {u'index': 5, u'word': u'analysis', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 54, u'characterOffsetBegin': 46, u'originalText': u'analysis', u'before': u' '}, {u'index': 6, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 55, u'characterOffsetBegin': 54, u'originalText': u',', u'before': u''}, {u'index': 7, u'word': u'constructed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 67, u'characterOffsetBegin': 56, u'originalText': u'constructed', u'before': u' '}, {u'index': 8, u'word': u'logistic', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 76, u'characterOffsetBegin': 68, u'originalText': u'logistic', u'before': u' '}, {u'index': 9, u'word': u'regression', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 87, u'characterOffsetBegin': 77, u'originalText': u'regression', u'before': u' '}, {u'index': 10, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 88, u'characterOffsetBegin': 87, u'originalText': u',', u'before': u''}, {u'index': 11, u'word': u'calibrated', u'after': u' ', u'pos': u'VBD', u'characterOffsetEnd': 99, u'characterOffsetBegin': 89, u'originalText': u'calibrated', u'before': u' '}, {u'index': 12, u'word': u'the', u'after': u' ', u'pos': u'DT', u'characterOffsetEnd': 103, u'characterOffsetBegin': 100, u'originalText': u'the', u'before': u' '}, {u'index': 13, u'word': u'low', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 107, u'characterOffsetBegin': 104, u'originalText': u'low', u'before': u' '}, {u'index': 14, u'word': u'defaults', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 116, u'characterOffsetBegin': 108, u'originalText': u'defaults', u'before': u' '}, {u'index': 15, u'word': u'portfolio', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 126, u'characterOffsetBegin': 117, u'originalText': u'portfolio', u'before': u' '}, {u'index': 16, u'word': u'to', u'after': u' ', u'pos': u'TO', u'characterOffsetEnd': 129, u'characterOffsetBegin': 127, u'originalText': u'to', u'before': u' '}, {u'index': 17, u'word': u'benchmark', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 139, u'characterOffsetBegin': 130, u'originalText': u'benchmark', u'before': u' '}, {u'index': 18, u'word': u'ratings', u'after': u'', u'pos': u'NNS', u'characterOffsetEnd': 147, u'characterOffsetBegin': 140, u'originalText': u'ratings', u'before': u' '}, {u'index': 19, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 148, u'characterOffsetBegin': 147, u'originalText': u',', u'before': u''}, {u'index': 20, u'word': u'performed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 158, u'characterOffsetBegin': 149, u'originalText': u'performed', u'before': u' '}, {u'index': 21, u'word': u'back', u'after': u'', u'pos': u'RB', u'characterOffsetEnd': 163, u'characterOffsetBegin': 159, u'originalText': u'back', u'before': u' '}] 
+0

由於OP想要得到一個正確間隔的字符串,而不是試圖使用偏移量,你可以使用'before'和'after'屬性來記錄令牌之間的空白。你可以看到這個逗號在''之前'' - 而且這個分析有''之後''。 –

+0

是否有可能將標記索引鏈接到JSON的'parse'鍵?這會有很大幫助,因爲我們可以從解析輸出中輕鬆訪問令牌的索引。或者,解析'collapsed-dependencies'/'basic-dependencies'鍵上的子JSON會更容易嗎? – alvas

2

在短:

使用Tree.leaves()功能解析句子中訪問子樹的字符串,即:

VPs_str = [" ".join(vp.leaves()) for vp in list(parsed_sent.subtrees(filter=lambda x: x.label()=='VP'))] 

有訪問真正的VP字符串,因爲沒有合適的方式,他們是因爲斯坦福分析器在解析過程之前對文本進行了標記,並且字符串的偏移量未被NLTK API保存=(


在長:

這種長期的答案是這樣的,其他NLTK用戶可以去訪問使用NLTK API斯坦福分析器的Tree對象的角度來看,它可能不會如所示的問題是小巫見大巫=)

首次設置了NLTK訪問斯坦福工具的環境變量,請參閱:

TL; DR

$ cd 
$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-12-09.zip 
$ unzip stanford-parser-full-2015-12-09.zip 
$ export STANFORDTOOLSDIR=$HOME 
$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar 

塗抹於2015年12月9日編譯斯坦福分析器黑客(這個技巧將與https://github.com/nltk/nltk/pull/1280/files最前沿的版本已經過時):

>>> from nltk.internals import find_jars_within_path 
>>> from nltk.parse.stanford import StanfordParser 
>>> parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz") 
>>> stanford_dir = parser._classpath[0].rpartition('/')[0] 
>>> parser._classpath = tuple(find_jars_within_path(stanford_dir)) 

現在來提取短語。

首先,我們分析了一句:

>>> sent = "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back" 
>>> parsed_sent = list(parser.raw_parse(sent))[0] 
>>> parsed_sent 
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('JJ', ['Selected']), Tree('NNS', ['variables'])]), Tree('PP', [Tree('IN', ['by']), Tree('NP', [Tree('JJ', ['univariate/multivariate']), Tree('NN', ['analysis'])])]), Tree(',', [',']), Tree('VP', [Tree('VBN', ['constructed']), Tree('NP', [Tree('NP', [Tree('JJ', ['logistic']), Tree('NN', ['regression'])]), Tree(',', [',']), Tree('ADJP', [Tree('VBN', ['calibrated']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['low']), Tree('NNS', ['defaults']), Tree('NN', ['portfolio'])]), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('JJ', ['benchmark']), Tree('NNS', ['ratings'])])])])])])]), Tree(',', [','])]), Tree('VP', [Tree('VBD', ['performed']), Tree('ADVP', [Tree('RB', ['back'])])])])]) 

然後我們遍歷樹和檢查VP你有做:

>>> VP_tree = list(tree.subtrees(filter=lambda x: x.label()=='VP')) 

Aftewards,我們簡單地用樹葉得到VP

>>> for vp in VPs: 
...  print " ".join(vp.leaves()) 
... 
constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings 
performed back 

所以得到VP字符串:

>>> VPs_str = [" ".join(vp.leaves()) for vp in list(parsed_sent.subtrees(filter=lambda x: x.label()=='VP'))] 
>>> VPs_str 
[u'constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings', u'performed back'] 

另外,我個人喜歡用細節化的,而不是一個完全成熟的解析器提取短語。

使用nltk_cli工具(https://github.com/alvations/nltk_cli):

[email protected]:~/git/nltk_cli$ echo "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back" > input-doneyo.txt 
[email protected]:~/git/nltk_cli$ python senna.py --chunk VP input-doneyo.txt calibrated|to benchmark|performed 
[email protected]:~/git/nltk_cli$ python senna.py --vp input-doneyo.txt 
calibrated|to benchmark|performed 
[email protected]:~/git/nltk_cli$ python senna.py --chunk2 VP+NP input-doneyo.txt 
calibrated the low defaults portfolio|to benchmark ratings 

的VP標籤的輸出由|分離,即

輸出:

calibrated|to benchmark|performed 

表示:

  • 校準
  • 到基準
  • 執行

而VP + NP組塊輸出也由|分離並將VP和NP由\t分離,即

輸出:

calibrated the low defaults portfolio|to benchmark ratings 

表示(VP + NP):

  • 校準+低違約組合
  • 到基準+評級
1

無關的NLTKStanfordParser,一個方式來獲得正常的閱讀文本是「detokenize」使用摩西SMT腳本輸出(https://github.com/moses-smt/mosesdecoder),例如:

[email protected]:~$ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/detokenizer.perl 
--2016-02-13 21:27:12-- https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/detokenizer.perl 
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 23.235.43.133 
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|23.235.43.133|:443... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: 12473 (12K) [text/plain] 
Saving to: ‘detokenizer.perl’ 

100%[===============================================================================================================================>] 12,473  --.-K/s in 0s  

2016-02-13 21:27:12 (150 MB/s) - ‘detokenizer.perl’ saved [12473/12473] 

[email protected]:~$ echo "constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings" 2> /tmp/null 
constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings 

請注意,輸出MIGHT不是與輸入相同,但對於大部分時間英語,它將被轉換爲我們讀/寫的正常文本。

它在管道在NLTK一個detokenizer但它會需要一段時間我們的代碼它,測試它,並將它推到倉庫,我們懇請您耐心等待(見https://github.com/nltk/nltk/issues/1214