一些優化是可能的3個柱數據進行排序每一行的使用,然後簡單地選擇第一或第二列基於NaNs
,由於被排序會被推到每行的末尾。這使我們可以使用slicing
進行選擇並獲得每行所需的median_low
值。
這裏的那些組裝成一個量化的解決方案 -
a = df.values
a_sorted = np.sort(a,1)
df['median'] = np.where(np.isnan(a_sorted[:,2]), a_sorted[:,0], a_sorted[:,1])
運行測試
途徑 -
# Proposed in this post
def vectorized_app(df):
a = df.values
a_sorted = np.sort(a,1)
df['median'] = np.where(np.isnan(a_sorted[:,2]), a_sorted[:,0], a_sorted[:,1])
return df
# @piRSquared's new soln
def vectorized_app2(df):
v = np.sort(df.values, axis=1)
n = np.count_nonzero(~np.isnan(v), axis=1)
j = (n - 1) // 2
i = np.arange(len(v))
return df.assign(median_low=v[i, j])
# @piRSquared's old soln
from statistics import median_low
def apply_app(df):
med = lambda x: median_low(x.dropna())
return df.apply(med, 1)
計時 -
In [433]: # Setup input dataframe and set one per row as NaN
...: np.random.seed(0)
...: a = np.random.randint(0,9,(10000,3)).astype(float)
...: idx = np.random.randint(0,3,a.shape[0])
...: a[np.arange(a.shape[0]), idx] = np.nan
...: df = pd.DataFrame(a)
...: df.columns = [['val1','val2','val3']]
...:
In [435]: %timeit vectorized_app(df)
1000 loops, best of 3: 481 µs per loop
In [436]: %timeit vectorized_app2(df)
1000 loops, best of 3: 892 µs per loop
In [434]: %timeit apply_app(df)
1 loop, best of 3: 1.15 s per loop
如果它的三列(奇數),爲什麼你需要擔心低? – Divakar
有時候有NULL值 –
NULL,你的意思是NaNs,對吧? – Divakar