用`testacy.extract.pos_regex_matches（...）與特定文本匹配PoS標記'

我使用textacy的pos_regex_matches方法來查找句子中的某些文本塊。用`testacy.extract.pos_regex_matches（...）與特定文本匹配PoS標記'

例如，假設我有文本：Huey, Dewey, and Louie are triplet cartoon characters.，我想檢測Huey, Dewey, and Louie是枚舉。

要做到這一點，我用下面的代碼（在testacy 0.3.4，可用的版本在寫作的時間）：它打印

import textacy 

sentence = 'Huey, Dewey, and Louie are triplet cartoon characters.' 
pattern = r'<PROPN>+ (<PUNCT|CCONJ> <PUNCT|CCONJ>? <PROPN>+)*' 
doc = textacy.Doc(sentence, lang='en') 
lists = textacy.extract.pos_regex_matches(doc, pattern) 
for list in lists: 
    print(list.text)

：

Huey, Dewey, and Louie

但是，如果我有什麼如下所示：

sentence = 'Donald Duck - Disney'

那麼-（破折號）是確認爲<PUNCT>，整個句子被識別爲一個列表 - 事實並非如此。

有沒有辦法指定只有,和;對列表有效<PUNCT>？

我已經找了一些關於這個正則表達式語言匹配PoS標籤沒有運氣的參考，任何人都可以幫忙嗎？提前致謝！

來源

2017-05-26 Stefano Bragaglia

嘗試用[，;]代替punct –

短，這是不可能的：見this official page。

但是合併請求包含的頁面描述的修改版本的代碼，因此可以重建的功能，儘管它的表現比使用SpaCy的Matcher（見code和example少 - 儘管我不知道如何使用Matcher重新實現我的問題）。

如果你想反正這道往下走，你必須改變行：

words.extend(map(lambda x: re.sub(r'\W', '', x), keyword_map[w]))

下列要求：

words.extend(keyword_map[w])

否則每一個在我的情況符號（如,和; ）將被剝離。

來源

2017-05-26 10:40:08

用`testacy.extract.pos_regex_matches（...）與特定文本匹配PoS標記'

回答

相關問題