2016-03-01 43 views
0

我試圖加快我的應用程序,並且我發現下面的簡單小函數(compute_ave_freq)實際上是最大的時間豬之一。罪魁禍首似乎是在取消NLTK FreqDist的時候;它需要花費大量的時間。更快的方式來存儲一個NLTK FreqDict?

當然,即使是這樣低的時間量也不到新計算FreqDist所需時間的一半。有沒有更好的方法來保存NLTK FreqDist對象?我嘗試將它序列化爲JSON,但將它保存爲簡單字典,失去了我需要的許多NLTK功能。

下面的代碼:

def compute_ave_freq(word_forms):  
    fd = pickle.load(open("data/fd.txt", 'rb')) 
    total_freq = 0 
    for form in word_forms: 
     freq = fd.freq(form) 
     total_freq += freq 
    try: 
     ave_freq = total_freq/len(word_forms) 
    except ZeroDivisionError: 
     ave_freq = 0 
    return ave_freq 

而這裏的LineProfiler輸出:

Total time: 0.197121 s 
File: /home/username/development/appname/filename.py 
Function: compute_ave_freq at line 25 
Line #  Hits   Time Per Hit % Time Line Contents 
============================================================== 
25           def compute_ave_freq(word_forms, debug=False): 
26            # word_forms is a list of morphological variations of a word, such as 
27            # ['كتبوا', 'كتبو', 'كتبنا', 'كتبت'] 
28           
29   1  78580 78580.0  79.1  fd = pickle.load(open("data/fd.txt", 'rb')) 
30   1   3  3.0  0.0  total_freq = 0 
31   5   10  2.0  0.0  for form in word_forms: 
32   4  20676 5169.0  20.8   freq = fd.freq(form) 
33   4   9  2.2  0.0   if debug==True: 
34              print(form, '\n', freq) 
35   4   6  1.5  0.0   total_freq += freq 
36   1   1  1.0  0.0  try: 
37   1   3  3.0  0.0   ave_freq = total_freq/len(word_forms) 
38            except ZeroDivisionError: 
39             ave_freq = 0 
40   1   1  1.0  0.0  return ave_freq 

謝謝!

+0

取儲存加載到RAM中,這是一個相當困難的問題來處理,但一旦它的加載它的罰款。可能將其放入某個數據庫(例如SQL/Mongo)將是使用更大數據集的更好方法。否則,只需稍等片刻即可加載到RAM中。 – alvas

+0

我認爲一般的規則可以是「如果你有一個數據集可以完全加載到RAM上,而且沒有太大的壓力,那麼在索引/查詢數據庫上節省的時間並不是很大,節省的時間可能不是很多」。 – alvas

+2

在函數外部移動'fd = pickle.load(open(「data/fd.txt」,'rb'))''並且只要'fd'改變就傳遞給函數,即'def compute_ave_freq(word_forms,fd) :'。不管怎樣,如果'fd'沒有改變,只需將'fd'設爲一個全局變量並加載一次即可。 – alvas

回答

1

正如評論所說,移動fd變量的函數外應解決的問題:

fd = pickle.load(open("data/fd.txt", 'rb')) 

def compute_ave_freq(word_forms):  
    total_freq = 0 
    for form in word_forms: 
     freq = fd.freq(form) 
     total_freq += freq 
    try: 
     ave_freq = total_freq/len(word_forms) 
    except ZeroDivisionError: 
     ave_freq = 0 
    return ave_freq 

但因爲你正在創建和平均功能,這裏有一個簡單的實現:

fd = pickle.load(open("data/fd.txt", 'rb')) 

def compute_ave_freq(word_forms): 
    try: 
     return sum([fd.freq(form) for form in word_forms])/len(word_forms) 
    except ZeroDivisionError: 
     return 0 

或者:

fd = pickle.load(open("data/fd.txt", 'rb')) 

def compute_ave_freq(word_forms): 
    l = len(word_forms) 
    if l > 0: 
     return sum([fd.freq(form) for form in word_forms])/l 
    else: 
     return 0 

或者簡單:

fd = pickle.load(open("data/fd.txt", 'rb')) 

def compute_ave_freq(word_forms): 
    l = len(word_forms) 
    return sum([fd.freq(form) for form in word_forms])/l if l > 0 else 0 

或者與lambda

fd = pickle.load(open("data/fd.txt", 'rb')) 
compute_ave_freq = lambda x: sum(fd.freq(i) for i in x)/len(x) 
ave_freq = compute_ave_freq(word_forms) if len(word_forms) > 0 else 0 

做看看EAFP and LBYL

+0

再次感謝你,並感謝您的參考。我使用了你建議的第一個變體(與列表理解),因爲lambda讓我頭疼...... :-) – larapsodia

相關問題