蟒蛇正則表達式nltk網站提取

嗨我從來沒有處理過正則表達式，我試圖用Python和NLTK預處理一些原始文本。當我試圖來標記使用文檔：蟒蛇正則表達式nltk網站提取

tokens = nltk.regexp_tokenize(corpus, sentence_re) 
sentence_re = r'''(?x) # set flag to allow verbose regexps 
    ([A-Z])(\.[A-Z])+\.? # abbreviations, e.g. U.S.A. 
| \w+(-\w+)*   # words with optional internal hyphens 
| \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82% 
| \#?\w+|\@?\w+   # hashtags and @ signs 
| \.\.\.    # ellipsis 
| [][.,;"'?()-_`]  # these are separate tokens 
| ?:http://|www.)[^"\' ]+ # websites 
'''

其不能夠把所有的網站爲一個單一的令牌：

print toks[:50] 
['on', '#Seamonkey', '(', 'SM', ')', '-', 'I', 'had', 'a', 'short', 'chirp', 'exchange', 'with', '@angie1234p', 'at', 'the', '18thDec', ';', 'btw', 'SM', 'is', 'faster', 'has', 'also', 'an', 'agile', '...', '1', '/', '2', "'", '...', 'user', 'community', '-', 'http', ':', '/', '/', 'bit', '.', 'ly', '/', 'XnF5', '+', 'ICR', 'http', ':', '/', '/']

任何幫助是極大的appreicated。非常感謝！

-Florie

來源

2011-10-06 Florie

自然語言解析是一個開始學習正則表達式的好地方。 –

在此標記生成RegularExpressions用來指定要如何從文本可以像提取令牌。我有點迷惑它你上面的很多正則表達式的使用，但對於一個非常簡單的標記化非空白符，你可以使用：

>>> corpus = "this is a sentence. and another sentence. my homepage is http://test.com" 
>>> nltk.regexp_tokenize(corpus, r"\S+") 
['this', 'is', 'a', 'sentence.', 'and', 'another', 'sentence.', 'my', 'homepage', 'is', 'http://test.com']

這相當於：

>>> corpus.split() 
['this', 'is', 'a', 'sentence.', 'and', 'another', 'sentence.', 'my', 'homepage', 'is', 'http://test.com']

另一種方法可以是使用NLTK功能sent_tokenize（）和nltk.word_tokenize（）：

>>> sentences = nltk.sent_tokenize(corpus) 
>>> sentences 
['this is a sentence.', 'and another sentence.', 'my homepage is http://test.com'] 
>>> for sentence in sentences: 
    print nltk.word_tokenize(sentence) 
['this', 'is', 'a', 'sentence', '.'] 
['and', 'another', 'sentence', '.'] 
['my', 'homepage', 'is', 'http', ':', '//test.com']

但如果您的文本包含大量網站的URL這個migh的不是最好的選擇。有關NLTK中不同分詞器的信息可以在here找到。

，如果你只是想提取語料的網址，你可以使用正則表達式是這樣的：

nltk.regexp_tokenize(corpus, r'(http://|https://|www.)[^"\' ]+')

希望這有助於。如果這不是您正在尋找的答案，請嘗試更準確地解釋您想要執行的操作，以及您想要令牌的外觀如何（例如，您想要的示例輸入/輸出），我們可以提供幫助找到正確的正則表達式。

來源

2011-10-06 21:25:14 tobigue

蟒蛇正則表達式nltk網站提取

回答

相關問題