2011-12-27 41 views
9

好的,最近我對自然語言處理非常感興趣:但是,在我的大部分工作中,我一直使用C語言。我聽說過NLTK,但我不瞭解Python,但它看起來很容易學,它看起來像是一個非常強大和有趣的語言。特別是,NLTK模塊似乎非常適合我需要做的事情。NLTK性能

但是,當使用sample code for NLTK並將其粘貼到名爲test.py的文件中時,我注意到它需要非常非常長的時間才能運行!

我從外殼調用它像這樣:

time python ./test.py 

和2.4 GHz的機器的內存4個GB的,它需要19.187秒!

現在,也許這是絕對正常的,但我的印象是NTLK是非常快;我可能錯了,但有什麼顯而易見的,我明顯在這裏做錯了嗎?

+3

你從哪裏得到NLTK速度非常快的印象? – 2011-12-27 15:24:47

+0

在有關'Python Text Processing with NLTK 2.0'的亞馬遜描述中:「瞭解如何輕鬆處理大量數據,而不會損失任何效率或速度。」 (http://www.amazon.com/Python-Text-Processing-NLTK-Cookbook/dp/1849513600)。 – elliottbolzan 2011-12-27 16:17:21

回答

19

我相信你會將培訓時間與處理時間混爲一談。像UnigramTagger一樣訓練模型可能需要很長時間。因此可以從磁盤上的pickle文件加載訓練好的模型。但是一旦你將模型加載到內存中,處理速度就會非常快。請參閱part of speech tagging with NLTK上我的帖子底部的「分類器效率」一節,以瞭解不同標記算法的處理速度。

7

@Jacob關於將訓練和標記時間混爲一談是正確的。我已經簡化了sample code一點,這裏的時間分解:

Importing nltk takes 0.33 secs 
Training time: 11.54 secs 
Tagging time: 0.0 secs 
Sorting time: 0.0 secs 

Total time: 11.88 secs 

系統:

CPU: Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz 
Memory: 3.7GB 

代碼:

import pprint, time 
startstart = time.clock() 

start = time.clock() 
import nltk 
print "Importing nltk takes", str((time.clock()-start)),"secs" 

start = time.clock() 
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+') 
tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents()) 
print "Training time:",str((time.clock()-start)),"secs" 


text = """Mr Blobby is a fictional character who featured on Noel 
Edmonds' Saturday night entertainment show Noel's House Party, 
which was often a ratings winner in the 1990s. Mr Blobby also 
appeared on the Jamie Rose show of 1997. He was designed as an 
outrageously over the top parody of a one-dimensional, mute novelty 
character, which ironically made him distinctive, absurd and popular. 
He was a large pink humanoid, covered with yellow spots, sporting a 
permanent toothy grin and jiggling eyes. He communicated by saying 
the word "blobby" in an electronically-altered voice, expressing 
his moods through tone of voice and repetition. 

There was a Mrs. Blobby, seen briefly in the video, and sold as a 
doll. 

However Mr Blobby actually started out as part of the 'Gotcha' 
feature during the show's second series (originally called 'Gotcha 
Oscars' until the threat of legal action from the Academy of Motion 
Picture Arts and Sciences[citation needed]), in which celebrities 
were caught out in a Candid Camera style prank. Celebrities such as 
dancer Wayne Sleep and rugby union player Will Carling would be 
enticed to take part in a fictitious children's programme based around 
their profession. Mr Blobby would clumsily take part in the activity, 
knocking over the set, causing mayhem and saying "blobby blobby 
blobby", until finally when the prank was revealed, the Blobby 
costume would be opened - revealing Noel inside. This was all the more 
surprising for the "victim" as during rehearsals Blobby would be 
played by an actor wearing only the arms and legs of the costume and 
speaking in a normal manner.[citation needed]""" 

start = time.clock() 
tokenized = tokenizer.tokenize(text) 
tagged = tagger.tag(tokenized) 
print "Tagging time:",str((time.clock()-start)),"secs" 

start = time.clock() 
tagged.sort(lambda x,y:cmp(x[1],y[1])) 
print "Sorting time:",str((time.clock()-start)),"secs" 

#l = list(set(tagged)) 
#pprint.pprint(l) 
print 
print "Total time:",str((time.clock()-startstart)),"secs" 
+1

很高興得到事實數據*和*代碼重播! – Titou 2016-10-26 15:19:08

0

我使用NLTK以下修改後的版本此代碼: https://github.com/ugik/notebooks/blob/master/Neural_Network_Classifier.ipynb

它運行良好,但我注意到我用來啓動此代碼的機器不會影響性能。我簡化代碼以將其限制爲「火車」功能定義並將其應用於單句語料庫訓練。我啓動它在不同的計算機:

TEST 1

的Linux 4.4.0-64泛型#85,Ubuntu的SMP週一2月20日11時50分30秒UTC 2017年 x86_64的x86_64的x86_64的GNU/Linux的

處理器:16×英特爾(R)至強(R)CPU E5-2686 V4 @ 2.30GHz

MemTotal:125827556 KB

導入ñ ltk和其他模塊需要0.9350419999999999秒

使用20個神經元進行訓練,alpha:0。1,迭代:10000,輟學:假

培訓時間:1.1798350000000006秒

TEST 2

的Linux 4.8.0-41泛型#44〜16.04.1 Ubuntu的SMP週五03月03日17點11分十六秒UTC 2017 x86_64的x86_64的x86_64的GNU/Linux的

處理器:4X英特爾(R)核心(TM)i5-7600K CPU @ 3.80GHz

MemTotal:16289540 KB

導入NLTK和其他模塊需要0.397839秒

訓練用20個神經元,α-:0.1,迭代:10000,差:假

訓練時間:0.7186329999999996秒

在16-Xeon核心/ 122Go RAM亞馬遜計算機和我的i5/16Go計算機上,培訓時間可能會長多久?