2016-02-19 44 views
1

我正在一些相當大的數據集上使用NLTK進行自然語言處理,並希望利用我所有的處理器內核。似乎多處理模塊就是我所追求的,當我運行以下測試代碼時,我看到所有內核都在使用,但代碼從未完成。Python多處理NLTK word_tokenizer - 函數永不完成

執行相同的任務,無需多處理,大約在一分鐘內完成。

關於debian的Python 2.7.11。

from nltk.tokenize import word_tokenize 
import io 
import time 
import multiprocessing as mp 

def open_file(filepath): 
    #open and parse file 
    file = io.open(filepath, 'rU', encoding='utf-8') 
    text = file.read() 
    return text 

def mp_word_tokenize(text_to_process): 
    #word tokenize 
    start_time = time.clock() 
    pool = mp.Pool(processes=8) 
    word_tokens = pool.map(word_tokenize, text_to_process) 
    finish_time = time.clock() - start_time 
    print 'Finished word_tokenize in [' + str(finish_time) + '] seconds. Generated [' + str(len(word_tokens)) + '] tokens' 
    return word_tokens 

filepath = "./p40_compiled.txt" 
text = open_file(filepath) 
tokenized_text = mp_word_tokenize(text) 
+0

好吧,對於任何遭受類似痛苦的人來說都是這樣 - 問題與通過pool.map()將文本傳遞給nltk.word_tokenize有關,它會通過字符**來遍歷字符串**字符。這爲word_tokenizer處理創建了一個巨大的迭代器,並且計算過程正在持續進行。通過將文本分塊成列表,其中項目與進程數匹配來解決。 *呼* –

回答

2

這裏是一個騙子的方式使用sframe做多線程:

>>> import sframe 
>>> import time 
>>> from nltk import word_tokenize 
>>> 
>>> import urllib.request 
>>> url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt' 
>>> response = urllib.request.urlopen(url) 
>>> data = response.read().decode('utf8') 
>>> 
>>> for _ in range(10): 
...  start = time.time() 
...  for line in data.split('\n'): 
...   x = word_tokenize(line) 
...  print ('word_tokenize():\t', time.time() - start) 
... 
word_tokenize():  4.058445692062378 
word_tokenize():  4.05820369720459 
word_tokenize():  4.090051174163818 
word_tokenize():  4.210559129714966 
word_tokenize():  4.17473030090332 
word_tokenize():  4.105806589126587 
word_tokenize():  4.082665681838989 
word_tokenize():  4.13646936416626 
word_tokenize():  4.185062408447266 
word_tokenize():  4.085020065307617 

>>> sf = sframe.SFrame(data.split('\n')) 
>>> for _ in range(10): 
...  start = time.time() 
...  x = list(sf.apply(lambda x: word_tokenize(x['X1']))) 
...  print ('word_tokenize() with sframe:\t', time.time() - start) 
... 
word_tokenize() with sframe:  7.174573659896851 
word_tokenize() with sframe:  5.072867393493652 
word_tokenize() with sframe:  5.129574775695801 
word_tokenize() with sframe:  5.10952091217041 
word_tokenize() with sframe:  5.015898942947388 
word_tokenize() with sframe:  5.037845611572266 
word_tokenize() with sframe:  5.015375852584839 
word_tokenize() with sframe:  5.016635894775391 
word_tokenize() with sframe:  5.155989170074463 
word_tokenize() with sframe:  5.132697105407715 

>>> for _ in range(10): 
...  start = time.time() 
...  x = [word_tokenize(line) for line in data.split('\n')] 
...  print ('str.split():\t', time.time() - start) 
... 
str.split():  4.176181793212891 
str.split():  4.116339921951294 
str.split():  4.1104896068573 
str.split():  4.140819549560547 
str.split():  4.103625774383545 
str.split():  4.125757694244385 
str.split():  4.10755729675293 
str.split():  4.177418947219849 
str.split():  4.11145281791687 
str.split():  4.140623092651367 

注意,速度差可能是因爲我有別的東西在其他核心上運行。但是,如果有更大的數據集和專用內核,您可以真正看到這個規模。