2016-11-21 77 views
1

Spacy包含noun_chunks功能來檢索一組名詞短語。 功能english_noun_chunks(附後)使用word.pos == NOUNSpacy NLP - 使用正則表達式分塊

def english_noun_chunks(doc): 
    labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj', 
       'attr', 'root'] 
    np_deps = [doc.vocab.strings[label] for label in labels] 
    conj = doc.vocab.strings['conj'] 
    np_label = doc.vocab.strings['NP'] 
    for i in range(len(doc)): 
     word = doc[i] 
     if word.pos == NOUN and word.dep in np_deps: 
      yield word.left_edge.i, word.i+1, np_label 
     elif word.pos == NOUN and word.dep == conj: 
      head = word.head 
      while head.dep == conj and head.head.i < head.i: 
       head = head.head 
      # If the head is an NP, and we're coordinated to it, we're an NP 
      if head.dep in np_deps: 
       yield word.left_edge.i, word.i+1, np_label 

我想從保持一定的正則表達式的一句話讓塊。例如,我的零個或多個形容詞後面跟着一個或多個名詞。

{(<JJ>)*(<NN | NNS | NNP>)+} 

有沒有可能不重寫english_noun_chunks函數?

回答

2

你可以在不損失任何性能的情況下重寫這個函數,因爲它是用純python實現的,但爲什麼不在獲取它們後過濾這些塊呢?

import re 
import spacy 

def filtered_chunks(doc, pattern): 
    for chunk in doc.noun_chunks: 
    signature = ''.join(['<%s>' % w.tag_ for w in chunk]) 
    if pattern.match(signature) is not None: 
     yield chunk 

nlp = spacy.load('en') 
doc = nlp(u'Great work!') 
pattern = re.compile(r'(<JJ>)*(<NN>|<NNS>|<NNP>)+') 

print(list(filtered_chunks(doc, pattern))) 
+0

那麼這個函數被Cython翻譯爲C的事實呢? – Serendipity

+0

你說得對,該文件具有'.pyx'擴展名,如果你改寫它,你將失去一些性能。但是,你是否需要重寫它,或者你可以簡單地過濾最終結果? –

相關問題