更快的方式來存儲一個NLTK FreqDict？

我試圖加快我的應用程序，並且我發現下面的簡單小函數（compute_ave_freq）實際上是最大的時間豬之一。罪魁禍首似乎是在取消NLTK FreqDist的時候;它需要花費大量的時間。更快的方式來存儲一個NLTK FreqDict？

當然，即使是這樣低的時間量也不到新計算FreqDist所需時間的一半。有沒有更好的方法來保存NLTK FreqDist對象？我嘗試將它序列化爲JSON，但將它保存爲簡單字典，失去了我需要的許多NLTK功能。

下面的代碼：

def compute_ave_freq(word_forms):  
    fd = pickle.load(open("data/fd.txt", 'rb')) 
    total_freq = 0 
    for form in word_forms: 
     freq = fd.freq(form) 
     total_freq += freq 
    try: 
     ave_freq = total_freq/len(word_forms) 
    except ZeroDivisionError: 
     ave_freq = 0 
    return ave_freq

而這裏的LineProfiler輸出：

Total time: 0.197121 s 
File: /home/username/development/appname/filename.py 
Function: compute_ave_freq at line 25 
Line #  Hits   Time Per Hit % Time Line Contents 
============================================================== 
25           def compute_ave_freq(word_forms, debug=False): 
26            # word_forms is a list of morphological variations of a word, such as 
27            # ['كتبوا', 'كتبو', 'كتبنا', 'كتبت'] 
28           
29   1  78580 78580.0  79.1  fd = pickle.load(open("data/fd.txt", 'rb')) 
30   1   3  3.0  0.0  total_freq = 0 
31   5   10  2.0  0.0  for form in word_forms: 
32   4  20676 5169.0  20.8   freq = fd.freq(form) 
33   4   9  2.2  0.0   if debug==True: 
34              print(form, '\n', freq) 
35   4   6  1.5  0.0   total_freq += freq 
36   1   1  1.0  0.0  try: 
37   1   3  3.0  0.0   ave_freq = total_freq/len(word_forms) 
38            except ZeroDivisionError: 
39             ave_freq = 0 
40   1   1  1.0  0.0  return ave_freq

謝謝！

來源

2016-03-01 larapsodia

取儲存加載到RAM中，這是一個相當困難的問題來處理，但一旦它的加載它的罰款。可能將其放入某個數據庫（例如SQL/Mongo）將是使用更大數據集的更好方法。否則，只需稍等片刻即可加載到RAM中。 – alvas

我認爲一般的規則可以是「如果你有一個數據集可以完全加載到RAM上，而且沒有太大的壓力，那麼在索引/查詢數據庫上節省的時間並不是很大，節省的時間可能不是很多」。 – alvas

在函數外部移動'fd = pickle.load（open（「data/fd.txt」，'rb'））''並且只要'fd'改變就傳遞給函數，即'def compute_ave_freq（word_forms，fd）：'。不管怎樣，如果'fd'沒有改變，只需將'fd'設爲一個全局變量並加載一次即可。 – alvas

正如評論所說，移動fd變量的函數外應解決的問題：

fd = pickle.load(open("data/fd.txt", 'rb')) 

def compute_ave_freq(word_forms):  
    total_freq = 0 
    for form in word_forms: 
     freq = fd.freq(form) 
     total_freq += freq 
    try: 
     ave_freq = total_freq/len(word_forms) 
    except ZeroDivisionError: 
     ave_freq = 0 
    return ave_freq

但因爲你正在創建和平均功能，這裏有一個簡單的實現：

fd = pickle.load(open("data/fd.txt", 'rb')) 

def compute_ave_freq(word_forms): 
    try: 
     return sum([fd.freq(form) for form in word_forms])/len(word_forms) 
    except ZeroDivisionError: 
     return 0

或者：

fd = pickle.load(open("data/fd.txt", 'rb')) 

def compute_ave_freq(word_forms): 
    l = len(word_forms) 
    if l > 0: 
     return sum([fd.freq(form) for form in word_forms])/l 
    else: 
     return 0

或者簡單：

fd = pickle.load(open("data/fd.txt", 'rb')) 

def compute_ave_freq(word_forms): 
    l = len(word_forms) 
    return sum([fd.freq(form) for form in word_forms])/l if l > 0 else 0

或者與lambda：

fd = pickle.load(open("data/fd.txt", 'rb')) 
compute_ave_freq = lambda x: sum(fd.freq(i) for i in x)/len(x) 
ave_freq = compute_ave_freq(word_forms) if len(word_forms) > 0 else 0

做看看EAFP and LBYL

來源

2016-03-02 03:00:18 alvas

再次感謝你，並感謝您的參考。我使用了你建議的第一個變體（與列表理解），因爲lambda讓我頭疼...... :-) – larapsodia

更快的方式來存儲一個NLTK FreqDict？

回答

相關問題