2016-02-11 66 views
8

我正在通過this questionPython re.split()vs nltk word_tokenize and sent_tokenize

我只是想知道在詞/句子標記化過程中NLTK是否比正則表達式快。

+0

...和什麼讓你無法嘗試它?運行一個示例並使用'timeit'對其進行計時? – lenz

+0

是python中的新特性,esp nltk。我剛纔注意到,當我從nltk切換時,re.split(),s.split()更快。我曾經使用過這些:句子= sent_tokenize(txt),現在這個:句子= re.split(r'(?<= [^ AZ]。[。?])+(?= [AZ])',txt) – wakamdr

+0

可能是因爲它必須在運行時加載wordnet,導致nltk變慢? – wakamdr

回答

15

默認nltk.word_tokenize()正在使用Treebank tokenizer模擬來自Penn Treebank tokenizer的標記器。

請注意,str.split()沒有達到在語言學意義上的標記,例如:

>>> sent = "This is a foo, bar sentence." 
>>> sent.split() 
['This', 'is', 'a', 'foo,', 'bar', 'sentence.'] 
>>> from nltk import word_tokenize 
>>> word_tokenize(sent) 
['This', 'is', 'a', 'foo', ',', 'bar', 'sentence', '.'] 

它通常用於向指定的分隔符,例如單獨字符串在製表符分隔的文件中,您可以使用str.split('\t'),或者當您的文本文件每行有一個句子時嘗試按換行符\n拆分字符串。

而且讓我們做一些基準測試中python3

import time 
from nltk import word_tokenize 

import urllib.request 
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt' 
response = urllib.request.urlopen(url) 
data = response.read().decode('utf8') 

for _ in range(10): 
    start = time.time() 
    for line in data.split('\n'): 
     line.split() 
    print ('str.split():\t', time.time() - start) 

for _ in range(10): 
    start = time.time() 
    for line in data.split('\n'): 
     word_tokenize(line) 
    print ('word_tokenize():\t', time.time() - start) 

[出]:

str.split():  0.05451083183288574 
str.split():  0.054320573806762695 
str.split():  0.05368804931640625 
str.split():  0.05416440963745117 
str.split():  0.05299568176269531 
str.split():  0.05304527282714844 
str.split():  0.05356955528259277 
str.split():  0.05473494529724121 
str.split():  0.053118228912353516 
str.split():  0.05236077308654785 
word_tokenize():  4.056122779846191 
word_tokenize():  4.052812337875366 
word_tokenize():  4.042144775390625 
word_tokenize():  4.101543664932251 
word_tokenize():  4.213029146194458 
word_tokenize():  4.411528587341309 
word_tokenize():  4.162556886672974 
word_tokenize():  4.225975036621094 
word_tokenize():  4.22914719581604 
word_tokenize():  4.203172445297241 

如果我們嘗試another tokenizers in bleeding edge NLTKhttps://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl

import time 
from nltk.tokenize import ToktokTokenizer 

import urllib.request 
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt' 
response = urllib.request.urlopen(url) 
data = response.read().decode('utf8') 

toktok = ToktokTokenizer().tokenize 

for _ in range(10): 
    start = time.time() 
    for line in data.split('\n'): 
     toktok(line) 
    print ('toktok:\t', time.time() - start) 

[OUT]:

toktok: 1.5902607440948486 
toktok: 1.5347232818603516 
toktok: 1.4993178844451904 
toktok: 1.5635688304901123 
toktok: 1.5779635906219482 
toktok: 1.8177132606506348 
toktok: 1.4538452625274658 
toktok: 1.5094449520111084 
toktok: 1.4871931076049805 
toktok: 1.4584410190582275 

(注:文本文件的來源是https://github.com/Simdiva/DSL-Task


如果我們看一下本機perl實施中,python VS perl時間ToktokTokenizer媲美。但是這樣做在Python實現的正則表達式是預編譯,而在Perl中,它不是後來the proof is still in the pudding

[email protected]:~$ wget https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl 
--2016-02-11 20:36:36-- https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl 
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133 
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: 2690 (2.6K) [text/plain] 
Saving to: ‘tok-tok.pl’ 

100%[===============================================================================================================================>] 2,690  --.-K/s in 0s  

2016-02-11 20:36:36 (259 MB/s) - ‘tok-tok.pl’ saved [2690/2690] 

[email protected]:~$ wget https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt 
--2016-02-11 20:36:38-- https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt 
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133 
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: 3483550 (3.3M) [text/plain] 
Saving to: ‘test.txt’ 

100%[===============================================================================================================================>] 3,483,550 363KB/s in 7.4s 

2016-02-11 20:36:46 (459 KB/s) - ‘test.txt’ saved [3483550/3483550] 

[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null 

real 0m1.703s 
user 0m1.693s 
sys 0m0.008s 
[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null 

real 0m1.715s 
user 0m1.704s 
sys 0m0.008s 
[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null 

real 0m1.700s 
user 0m1.686s 
sys 0m0.012s 
[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null 

real 0m1.727s 
user 0m1.700s 
sys 0m0.024s 
[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null 

real 0m1.734s 
user 0m1.724s 
sys 0m0.008s 

(注:當定時tok-tok.pl,我們不得不管道輸出到一個文件中,所以時機這裏包括機器需要輸出到文件的時間,而在nltk.tokenize.ToktokTokenizer時機,這不包括的時間來輸出到文件)


至於sent_tokenize(),這是一個有點不同,比較速度基準而不考慮準確性有點古怪。

考慮一下:

  • 如果一個正則表達式拆分一個文本/段落上的1句話,那麼速度幾乎是瞬間的,即0完成工作。但那將是一個可怕的句子標記符...

  • 如果在一個文件中的句子已經被\n分開,然後就是簡單地比較如何str.split('\n') VS re.split('\n')nltk就什麼都沒有做的句子切分的情況下,P

的信息在NLTK如何sent_tokenize()作品,請參閱:

因此,要有效地比較sent_tokenize() VS其他正則表達式爲基礎的方法(不str.split('\n')),一個本來還評估的準確性和與人的角度評價句子的數據集的標記化格式。

考慮一下這個任務:https://www.hackerrank.com/challenges/from-paragraphs-to-sentences

中的文字:

在第三類中,他包括那些兄弟(大多數)誰 什麼也沒有看到在共濟會,但外在形式和儀式,並 珍視這些形式的嚴格執行,而不會妨礙其目的或意義。這是威拉爾斯基,甚至是主要旅館的主人。最後,到第四類還有一個 很多兄弟都屬於,特別是最近加入了 的那些。根據皮埃爾的觀察,這些人對任何事物都沒有信念,也沒有對任何事物的渴望,但是加入共濟會員,只是爲了與那些有影響力的富有的年輕兄弟聯繫起來許多人在旅館裏。皮埃爾開始對他在做什麼感到不滿。無論如何,他在這裏看到它,有時候他似乎僅僅基於外部的東西。他並沒有懷疑共濟會本身,但懷疑俄羅斯石工已經採取了錯誤的道路,並偏離了原來的原則。等到年底 他出國開始進入 更高的訂單的祕密。在這些情況下要做什麼?致 青睞革命,推翻一切,以武力排斥力?不!我們 是非常遙遠的。每一次暴力改革都值得譴責,因爲它不能在邪惡的時候補救邪惡,而且 因爲智慧不需要暴力。 「但是有什麼在穿越 這樣呢?」伊拉金的新郎說。 「一旦她錯過了它,並把它變成 它,任何雜種的人都可以接受它,」伊拉金在同樣的時間說,在他的疾馳和他的興奮喘不過氣來。

我們希望得到這樣的:

In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. 
Such were Willarski and even the Grand Master of the principal lodge. 
Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. 
These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge. 
Pierre began to feel dissatisfied with what he was doing. 
Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. 
He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. 
And so toward the end of the year he went abroad to be initiated into the higher secrets of the order. 
What is to be done in these circumstances? 
To favor revolutions, overthrow everything, repel force by force? 
No! 
We are very far from that. 
Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. 
"But what is there in running across it like that?" said Ilagin's groom. 
"Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. 

所以簡單地做str.split('\n')會給你什麼。即使沒有考慮句子的順序,你也會得到0的肯定結果:

>>> text = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. """ 
>>> answer = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. 
... Such were Willarski and even the Grand Master of the principal lodge. 
... Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. 
... These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge. 
... Pierre began to feel dissatisfied with what he was doing. 
... Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. 
... He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. 
... And so toward the end of the year he went abroad to be initiated into the higher secrets of the order. 
... What is to be done in these circumstances? 
... To favor revolutions, overthrow everything, repel force by force? 
... No! 
... We are very far from that. 
... Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. 
... "But what is there in running across it like that?" said Ilagin's groom. 
... "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.""" 
>>> 
>>> output = text.split('\n') 
>>> sum(1 for sent in text.split('\n') if sent in answer) 
0 
+0

很好的答案。我喜歡列入一些簡單的基準。 – erewok

+0

我認爲這個問題涉及句子拆分,而不是詞語表徵。 – lenz

+0

lenz,它仍然是一個非常好的答案 – wakamdr