我正在通過this question。Python re.split()vs nltk word_tokenize and sent_tokenize
我只是想知道在詞/句子標記化過程中NLTK是否比正則表達式快。
我正在通過this question。Python re.split()vs nltk word_tokenize and sent_tokenize
我只是想知道在詞/句子標記化過程中NLTK是否比正則表達式快。
默認nltk.word_tokenize()
正在使用Treebank tokenizer模擬來自Penn Treebank tokenizer的標記器。
請注意,str.split()
沒有達到在語言學意義上的標記,例如:
>>> sent = "This is a foo, bar sentence."
>>> sent.split()
['This', 'is', 'a', 'foo,', 'bar', 'sentence.']
>>> from nltk import word_tokenize
>>> word_tokenize(sent)
['This', 'is', 'a', 'foo', ',', 'bar', 'sentence', '.']
它通常用於向指定的分隔符,例如單獨字符串在製表符分隔的文件中,您可以使用str.split('\t')
,或者當您的文本文件每行有一個句子時嘗試按換行符\n
拆分字符串。
而且讓我們做一些基準測試中python3
:
import time
from nltk import word_tokenize
import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
for _ in range(10):
start = time.time()
for line in data.split('\n'):
line.split()
print ('str.split():\t', time.time() - start)
for _ in range(10):
start = time.time()
for line in data.split('\n'):
word_tokenize(line)
print ('word_tokenize():\t', time.time() - start)
[出]:
str.split(): 0.05451083183288574
str.split(): 0.054320573806762695
str.split(): 0.05368804931640625
str.split(): 0.05416440963745117
str.split(): 0.05299568176269531
str.split(): 0.05304527282714844
str.split(): 0.05356955528259277
str.split(): 0.05473494529724121
str.split(): 0.053118228912353516
str.split(): 0.05236077308654785
word_tokenize(): 4.056122779846191
word_tokenize(): 4.052812337875366
word_tokenize(): 4.042144775390625
word_tokenize(): 4.101543664932251
word_tokenize(): 4.213029146194458
word_tokenize(): 4.411528587341309
word_tokenize(): 4.162556886672974
word_tokenize(): 4.225975036621094
word_tokenize(): 4.22914719581604
word_tokenize(): 4.203172445297241
如果我們嘗試another tokenizers in bleeding edge NLTK從https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl:
import time
from nltk.tokenize import ToktokTokenizer
import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
toktok = ToktokTokenizer().tokenize
for _ in range(10):
start = time.time()
for line in data.split('\n'):
toktok(line)
print ('toktok:\t', time.time() - start)
[OUT]:
toktok: 1.5902607440948486
toktok: 1.5347232818603516
toktok: 1.4993178844451904
toktok: 1.5635688304901123
toktok: 1.5779635906219482
toktok: 1.8177132606506348
toktok: 1.4538452625274658
toktok: 1.5094449520111084
toktok: 1.4871931076049805
toktok: 1.4584410190582275
(注:文本文件的來源是https://github.com/Simdiva/DSL-Task)
如果我們看一下本機perl
實施中,python
VS perl
時間ToktokTokenizer
媲美。但是這樣做在Python實現的正則表達式是預編譯,而在Perl中,它不是後來the proof is still in the pudding:
[email protected]:~$ wget https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
--2016-02-11 20:36:36-- https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2690 (2.6K) [text/plain]
Saving to: ‘tok-tok.pl’
100%[===============================================================================================================================>] 2,690 --.-K/s in 0s
2016-02-11 20:36:36 (259 MB/s) - ‘tok-tok.pl’ saved [2690/2690]
[email protected]:~$ wget https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
--2016-02-11 20:36:38-- https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3483550 (3.3M) [text/plain]
Saving to: ‘test.txt’
100%[===============================================================================================================================>] 3,483,550 363KB/s in 7.4s
2016-02-11 20:36:46 (459 KB/s) - ‘test.txt’ saved [3483550/3483550]
[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null
real 0m1.703s
user 0m1.693s
sys 0m0.008s
[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null
real 0m1.715s
user 0m1.704s
sys 0m0.008s
[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null
real 0m1.700s
user 0m1.686s
sys 0m0.012s
[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null
real 0m1.727s
user 0m1.700s
sys 0m0.024s
[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null
real 0m1.734s
user 0m1.724s
sys 0m0.008s
(注:當定時tok-tok.pl
,我們不得不管道輸出到一個文件中,所以時機這裏包括機器需要輸出到文件的時間,而在nltk.tokenize.ToktokTokenizer
時機,這不包括的時間來輸出到文件)
至於sent_tokenize()
,這是一個有點不同,比較速度基準而不考慮準確性有點古怪。
考慮一下:
如果一個正則表達式拆分一個文本/段落上的1句話,那麼速度幾乎是瞬間的,即0完成工作。但那將是一個可怕的句子標記符...
如果在一個文件中的句子已經被\n
分開,然後就是簡單地比較如何str.split('\n')
VS re.split('\n')
和nltk
就什麼都沒有做的句子切分的情況下,P
的信息在NLTK如何sent_tokenize()
作品,請參閱:
因此,要有效地比較sent_tokenize()
VS其他正則表達式爲基礎的方法(不str.split('\n')
),一個本來還評估的準確性和與人的角度評價句子的數據集的標記化格式。
考慮一下這個任務:https://www.hackerrank.com/challenges/from-paragraphs-to-sentences
中的文字:
在第三類中,他包括那些兄弟(大多數)誰 什麼也沒有看到在共濟會,但外在形式和儀式,並 珍視這些形式的嚴格執行,而不會妨礙其目的或意義。這是威拉爾斯基,甚至是主要旅館的主人。最後,到第四類還有一個 很多兄弟都屬於,特別是最近加入了 的那些。根據皮埃爾的觀察,這些人對任何事物都沒有信念,也沒有對任何事物的渴望,但是加入共濟會員,只是爲了與那些有影響力的富有的年輕兄弟聯繫起來許多人在旅館裏。皮埃爾開始對他在做什麼感到不滿。無論如何,他在這裏看到它,有時候他似乎僅僅基於外部的東西。他並沒有懷疑共濟會本身,但懷疑俄羅斯石工已經採取了錯誤的道路,並偏離了原來的原則。等到年底 他出國開始進入 更高的訂單的祕密。在這些情況下要做什麼?致 青睞革命,推翻一切,以武力排斥力?不!我們 是非常遙遠的。每一次暴力改革都值得譴責,因爲它不能在邪惡的時候補救邪惡,而且 因爲智慧不需要暴力。 「但是有什麼在穿越 這樣呢?」伊拉金的新郎說。 「一旦她錯過了它,並把它變成 它,任何雜種的人都可以接受它,」伊拉金在同樣的時間說,在他的疾馳和他的興奮喘不過氣來。
我們希望得到這樣的:
In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
Such were Willarski and even the Grand Master of the principal lodge.
Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
Pierre began to feel dissatisfied with what he was doing.
Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
What is to be done in these circumstances?
To favor revolutions, overthrow everything, repel force by force?
No!
We are very far from that.
Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
"But what is there in running across it like that?" said Ilagin's groom.
"Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.
所以簡單地做str.split('\n')
會給你什麼。即使沒有考慮句子的順序,你也會得到0的肯定結果:
>>> text = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. """
>>> answer = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
... Such were Willarski and even the Grand Master of the principal lodge.
... Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
... These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
... Pierre began to feel dissatisfied with what he was doing.
... Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
... He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
... And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
... What is to be done in these circumstances?
... To favor revolutions, overthrow everything, repel force by force?
... No!
... We are very far from that.
... Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
... "But what is there in running across it like that?" said Ilagin's groom.
... "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement."""
>>>
>>> output = text.split('\n')
>>> sum(1 for sent in text.split('\n') if sent in answer)
0
...和什麼讓你無法嘗試它?運行一個示例並使用'timeit'對其進行計時? – lenz
是python中的新特性,esp nltk。我剛纔注意到,當我從nltk切換時,re.split(),s.split()更快。我曾經使用過這些:句子= sent_tokenize(txt),現在這個:句子= re.split(r'(?<= [^ AZ]。[。?])+(?= [AZ])',txt) – wakamdr
可能是因爲它必須在運行時加載wordnet,導致nltk變慢? – wakamdr