Spacy NLP - 使用正則表達式分塊

Spacy包含noun_chunks功能來檢索一組名詞短語。功能english_noun_chunks（附後）使用word.pos == NOUNSpacy NLP - 使用正則表達式分塊

def english_noun_chunks(doc): 
    labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj', 
       'attr', 'root'] 
    np_deps = [doc.vocab.strings[label] for label in labels] 
    conj = doc.vocab.strings['conj'] 
    np_label = doc.vocab.strings['NP'] 
    for i in range(len(doc)): 
     word = doc[i] 
     if word.pos == NOUN and word.dep in np_deps: 
      yield word.left_edge.i, word.i+1, np_label 
     elif word.pos == NOUN and word.dep == conj: 
      head = word.head 
      while head.dep == conj and head.head.i < head.i: 
       head = head.head 
      # If the head is an NP, and we're coordinated to it, we're an NP 
      if head.dep in np_deps: 
       yield word.left_edge.i, word.i+1, np_label

我想從保持一定的正則表達式的一句話讓塊。例如，我的零個或多個形容詞後面跟着一個或多個名詞。

{(<JJ>)*(<NN | NNS | NNP>)+}

有沒有可能不重寫english_noun_chunks函數？

來源

2016-11-21 Serendipity

你可以在不損失任何性能的情況下重寫這個函數，因爲它是用純python實現的，但爲什麼不在獲取它們後過濾這些塊呢？

import re 
import spacy 

def filtered_chunks(doc, pattern): 
    for chunk in doc.noun_chunks: 
    signature = ''.join(['<%s>' % w.tag_ for w in chunk]) 
    if pattern.match(signature) is not None: 
     yield chunk 

nlp = spacy.load('en') 
doc = nlp(u'Great work!') 
pattern = re.compile(r'(<JJ>)*(<NN>|<NNS>|<NNP>)+') 

print(list(filtered_chunks(doc, pattern)))

來源

2016-11-21 10:51:02

那麼這個函數被Cython翻譯爲C的事實呢？ – Serendipity

你說得對，該文件具有'.pyx'擴展名，如果你改寫它，你將失去一些性能。但是，你是否需要重寫它，或者你可以簡單地過濾最終結果？ –

Spacy NLP - 使用正則表達式分塊

回答

相關問題