Python - 帶條件的正則表達式標記器

我試圖刪除標點符號化python中的句子，但我有幾個「condtitions」，我希望它忽略使用標點符號化。一些例子是當我看到一個URL或電子郵件地址或某些符號旁邊沒有空格時。例如：Python - 帶條件的正則表達式標記器

from nltk.tokenize import RegexpTokenizer 
tokenizer = RegexpTokenizer("[\w']+") 

tokenizer.tokenize("please help me ignore punctuation like . or , but at the same time don't ignore if it looks like a url i.e. google.com or google.co.uk. Sometimes I also want conditions where I see an equals sign between words such as myname=shecode")

現在輸出看起來像

[ '請'， '救命'， '我'， '忽略'， '標點符號'， '喜歡'， '或'， 'but'， 'at'，'''，'same'，'time'，'do not'，'ignore'，'if'，'it'，'looks'，' 'like'，'a' '，'url'，'i'，'e'，'google'，'com'，'or'，'google'，'co'， '英國'，'有時'，'我'，' '''，'條件'，'where'，'I'， 'see'，'an'，'equals'，'sign'，'between'，'words'，'such'，'as' 'myname'，'shecode']

但我真的希望它看起來就像是

[ '請'， '救命'， '我'， '忽略'， '標點符號'， '喜歡'， '或'， 'but'， 'at'，'''，'same'，'time'，'do not'，'ignore'，'if'，'it'，'looks'，' 'like'，'a' '，'url'，'我'，'e'，'google.com'，'或'，'google.co.uk'， '有時'，'我'，'也'，'想'，'條件'，'where'，'I'，'see'， 'an'，'等於'，'符號'，'之間'，'文字'，'such'，'as'， 'myname = shecode' ]

來源

2017-10-16 shecode

嘗試使用「從nltk.tokenize進口word_tokenize」。我不確定它是否能解決你的目的。但嘗試一次。謝謝。 – Gunjan

您應該a）預先標記輸入的空格; b）檢查每件作品是否是網址;和c）以不同的方式處理url和非url標記。 – alexis

您正則表達式更改爲以下表達式

tokenizer = RegexpTokenizer("[\w+.]+")

在正則表達式.指任何字符。

所以在你的代碼中，它也分裂在.。所以新的正則表達式將阻止.

來源

2017-10-16 04:29:27 arjunsv3691

嗨，有時我想讓它分裂它，但它會有條件的。也許如果我們看到「.com」或者「.co」，那麼我們不希望它被分割，這是否有意義？ – shecode

in正則表達式'.'表示任何字符，除了在括號'['和']之間' – Indent

嘗試使用此代碼，如果它適合你。

from nltk.tokenize import word_tokenize 
punct_list = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'] 
s = "please help me ignore punctuation like . or , but at the same time don't ignore if it looks like a url i.e. google.com or google.co.uk. Sometimes I also want conditions where I see an equals sign between words such as myname=shecode" 
print [i.strip("".join(punct_list)) for i in word_tokenize(s) if i not in punct_list]

入住這How to remove punctuation?以及

來源

2017-10-16 06:15:16 Gunjan

可以使用更復雜的正則表達式記號化，例如TreebankTokenizer從nltk.word_tokenize，看到How do I tokenize a string sentence in NLTK?：

>>> from nltk import word_tokenize 
>>> text ="please help me ignore punctuation like . or , but at the same time don't ignore if it looks like a url i.e. google.com or google.co.uk. Sometimes I also want conditions where I see an equals sign between words such as myname=shecode" 
>>> word_tokenize(text) 
['please', 'help', 'me', 'ignore', 'punctuation', 'like', '.', 'or', ',', 'but', 'at', 'the', 'same', 'time', 'do', "n't", 'ignore', 'if', 'it', 'looks', 'like', 'a', 'url', 'i.e', '.', 'google.com', 'or', 'google.co.uk', '.', 'Sometimes', 'I', 'also', 'want', 'conditions', 'where', 'I', 'see', 'an', 'equals', 'sign', 'between', 'words', 'such', 'as', 'myname=shecode']

如果你想刪除停用詞，看到Stopword removal with NLTK

>>> from string import punctuation 
>>> from nltk.corpus import stopwords 
>>> from nltk import word_tokenize 

>>> stoplist = stopwords.words('english') + list(punctuation) 

>>> text ="please help me ignore punctuation like . or , but at the same time don't ignore if it looks like a url i.e. google.com or google.co.uk. Sometimes I also want conditions where I see an equals sign between words such as myname=shecode" 

>>> word_tokenize(text) 
['please', 'help', 'me', 'ignore', 'punctuation', 'like', '.', 'or', ',', 'but', 'at', 'the', 'same', 'time', 'do', "n't", 'ignore', 'if', 'it', 'looks', 'like', 'a', 'url', 'i.e', '.', 'google.com', 'or', 'google.co.uk', '.', 'Sometimes', 'I', 'also', 'want', 'conditions', 'where', 'I', 'see', 'an', 'equals', 'sign', 'between', 'words', 'such', 'as', 'myname=shecode'] 

>>> [token for token in word_tokenize(text) if token not in stoplist] 
['please', 'help', 'ignore', 'punctuation', 'like', 'time', "n't", 'ignore', 'looks', 'like', 'url', 'i.e', 'google.com', 'google.co.uk', 'Sometimes', 'I', 'also', 'want', 'conditions', 'I', 'see', 'equals', 'sign', 'words', 'myname=shecode']

來源

2017-10-16 06:57:01 alvas

Python - 帶條件的正則表達式標記器

回答

相關問題