2017-05-06 36 views
0

我想根據執行時間優化我的代碼。代碼運行在數據框alldata上,其中包含大約300,000個條目,但計算需要很長時間(大約10個小時左右)。根據執行時間優化嵌套for循環

計算的邏輯如下:

對於每個缺少的(楠)數據幀中的列的列表中的list_of_NA_features規定值,fill_missing_values搜索最相似的行的函數(餘弦相似性的計算基於列在從不爲空的列表list_of_non_nan_features中)並返回alldata中當前列和行的值。

from scipy import spatial 

def fill_missing_values(param_nan,current_row,df): 
    df_non_nan = df.dropna(subset=[param_nan]) 
    list_of_non_nan_features = ["f1","f2","f3","f4","f5"] 
    max_val = 0 
    searched_val = 0 
    vector1 = current_row[list_of_non_nan_features].values 
    for index, row in df_non_nan.iterrows(): 
     vector2 = row[list_of_non_nan_features].values 
     sim = 1 - spatial.distance.cosine(vector1, vector2) 
     if (sim>max_val): 
      max_val = sim 
      searched_val = row[param_nan] 
    return searched_val 


list_of_NA_features = df_train.columns[df_train.isnull().any()] 


for feature in list_of_NA_features: 
    for index,row in alldata.iterrows(): 
     if (pd.isnull(row[feature]) == True): 
      missing_value = fill_missing_values(feature,row,alldata) 
      alldata.ix[index,feature] = missing_value 

是否可以優化代碼?例如,我正在考慮用函數代替函數。可能嗎?

+0

如何使您的for-loops'lambda'函數有所幫助?爲什麼'lambda'函數而不是普通函數? –

+0

@ juanpa.arrivillaga這是我的假設,因爲我在讀'apply(lambda x:...)'比循環更快。 – Dinosaurius

+0

它*當然不是*。 'pandas.DataFrame.apply'是引擎蓋下的一個python for-loop。 –

回答

1

而是與lambdas替換您的for循環中,嘗試用ufuncs.

Losing Your Loops: Fast Numerical Computation with Numpy取代他們的是關於這個問題的一個很好的談話由Jake Vanderplass。 使用通用函數和廣播代替for循環可以顯着提高代碼的速度。

這是一個基本的例子:

import numpy as np 
from time import time 

def timed(func): 
    def inner(*args, **kwargs): 
     t0 = time() 
     result = func(*args, **kwargs) 
     elapsed = time()-t0 
     print(f'ran {func.__name__} in {elapsed} seconds)') 
     return result 
    return inner 
# without broadcasting: 

@timed 
def sums(): 
    sums = np.zeros([500, 500]) 
    for a in range(500): 
     for b in range(500): 
      sums[a, b] = a+b 
    return sums 

@timed 
def sums_broadcasted(): 
    a = np.arange(500) 
    b = np.reshape(np.arange(500), [500, 1]) 
    return a+b 

輸入:

sums() 
sums_broadcasted() 
assert (a==b).all() 

OUTPUT:

ran sums in 0.030008554458618164 seconds 
ran sums_broadcasted in 0.0005011558532714844 seconds 

注意消除我們的循環,我們有一個60倍的速度提升!