如何提取使用NLTK RegexpParser組塊的POS_tagged詞特殊字符在Python

我有例如一些文字說：80% of $300,000 Each Human Resource/IT Department.如何提取使用NLTK RegexpParser組塊的POS_tagged詞特殊字符在Python

我需要與Each Human Resource/IT Department

我已經使用詞性標註的話一起提取$300,000標記後標記單詞。我能夠提取300,000，但無法提取$符號。

我到目前爲止有：

text = '80% of $300,000 Each Human Resource/IT Department' 
train_text = text 
sample_text = text 
custom_sent_tokenizer = PunktSentenseTokenizer(train_text) 
tokenized = custom_sent_tokenizer.tokenize(sample_text) 

for i in tokenized: 
    words = nltk.word_tokenize(i) 
    tagged = nltk.pos_tag(words) 

chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|<NNP>?}""" 


chunkParser = nltk.RegexpParser(chunkGram) 
chunked = chunkParser.parse(tagged)

時coverted到列表分塊的輸出 - ['80 %', '300,000', 'Each Human Resource/IT Department']

我想要的東西：['80 %', '**$**300,000', 'Each Human Resource/IT Department']

我試圖

chunkGram = r"""chunk: {**</$CD>|**<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|} 「」」

它仍然不起作用。所以，我需要的是一個$與CD

來源

2016-07-06 SVK

沿着您需要添加< \ $>？在你的語法中。

chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<\$>?<CD>+<NN>?|<NNP>?}"""

代碼：

import nltk 
from nltk.tokenize import PunktSentenceTokenizer 

text = '80% of $300,000 Each Human Resource/IT Department' 
train_text = text 
sample_text = text 
custom_sent_tokenizer = PunktSentenceTokenizer(train_text) 
tokenized = custom_sent_tokenizer.tokenize(sample_text) 

for i in tokenized: 
    words = nltk.word_tokenize(i) 
    tagged = nltk.pos_tag(words) 

chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<\$>?<CD>+<NN>?|<NNP>?}""" 

chunkParser = nltk.RegexpParser(chunkGram) 
chunked = chunkParser.parse(tagged) 

print(chunked)

輸出：

(S 
    (chunk 80/CD %/NN) 
    of/IN 
    (chunk $/$ 300,000/CD) 
    (chunk Each/DT Human/NNP Resource/IT/NNP Department/NNP))

來源

2016-07-10 13:26:16 RAVI

我試過了，但它不會與數一起帶來$。謝謝 – SVK

你沒有嘗試相同的chunkGram。這是我寫的chunkGram中的一個區別。嘗試複製+粘貼此代碼並在您的系統中測試。它會給'$'。 – RAVI

如何提取使用NLTK RegexpParser組塊的POS_tagged詞特殊字符在Python

回答

相關問題