2016-11-29 72 views
-2

我特別沒有Pands Merge的性能問題,正如其他帖子所暗示的,但是我有一個有很多方法的類,它在數據集上進行了很多合併。提高熊貓合併性能

該班有10個左右的團隊合併。雖然groupby的速度相當快,但在1.5秒的全部執行時間中,在這15個合併調用中大約需要0.7秒。

我想加快這些合併調用中的性能。因爲我將有大約4000次迭代,因此在單次迭代中整體節省0.5秒將導致整體性能下降大約30分鐘,這將非常好。

我應該嘗試的任何建議?我試過: Cython Numba,Numba比較慢。

感謝

編輯1: 添加示例代碼段: 我的合併報表:

tmpDf = pd.merge(self.data, t1, on='APPT_NBR', how='left') 
tmp = tmpDf 

tmpDf = pd.merge(tmp, t2, on='APPT_NBR', how='left') 
tmp = tmpDf 

tmpDf = pd.merge(tmp, t3, on='APPT_NBR', how='left') 
tmp = tmpDf 

tmpDf = pd.merge(tmp, t4, on='APPT_NBR', how='left') 
tmp = tmpDf 

tmpDf = pd.merge(tmp, t5, on='APPT_NBR', how='left') 

而且,通過實現連接,我包括下列satatements:

dat = self.data.set_index('APPT_NBR') 

t1.set_index('APPT_NBR', inplace=True) 
t2.set_index('APPT_NBR', inplace=True) 
t3.set_index('APPT_NBR', inplace=True) 
t4.set_index('APPT_NBR', inplace=True) 
t5.set_index('APPT_NBR', inplace=True) 

tmpDf = dat.join(t1, how='left') 
tmpDf = tmpDf.join(t2, how='left') 
tmpDf = tmpDf.join(t3, how='left') 
tmpDf = tmpDf.join(t4, how='left') 
tmpDf = tmpDf.join(t5, how='left') 

tmpDf.reset_index(inplace=True) 

注,都是名爲函數的一部分:def merge_earlier_created_values(self):

而且,當我按照做timedcall從profilehooks:

@timedcall(immediate=True) 
def merge_earlier_created_values(self): 

我得到以下結果:

該方法的分析結果得出:

@profile(immediate=True) 
def merge_earlier_created_values(self): 

剖析通過使用合併如下:

*** PROFILER RESULTS *** 
merge_earlier_created_values (E:\Projects\Predictive Inbound Cartoon  Estimation-MLO\Python\CodeToSubmit\helpers\get_prev_data_by_date.py:122) 
function called 1 times 

    71665 function calls (70588 primitive calls) in 0.524 seconds 

Ordered by: cumulative time, internal time, call count 
List reduced from 563 to 40 due to restriction <40> 

ncalls tottime percall cumtime percall filename:lineno(function) 
    1 0.012 0.012 0.524 0.524 get_prev_data_by_date.py:122(merge_earlier_created_values) 
    14 0.000 0.000 0.285 0.020 generic.py:1901(_update_inplace) 
    14 0.000 0.000 0.285 0.020 generic.py:1402(_maybe_update_cacher) 
    19 0.000 0.000 0.284 0.015 generic.py:1492(_check_setitem_copy) 
    7 0.283 0.040 0.283 0.040 {built-in method gc.collect} 
    15 0.000 0.000 0.181 0.012 generic.py:1842(drop) 
    10 0.000 0.000 0.153 0.015 merge.py:26(merge) 
    10 0.000 0.000 0.140 0.014 merge.py:201(get_result) 
    8/4 0.000 0.000 0.126 0.031 decorators.py:65(wrapper) 
    4 0.000 0.000 0.126 0.031 frame.py:3028(drop_duplicates) 
    1 0.000 0.000 0.102 0.102 get_prev_data_by_date.py:264(recreate_previous_cartons) 
    1 0.000 0.000 0.101 0.101 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date) 
    1 0.000 0.000 0.098 0.098 get_prev_data_by_date.py:360(recreate_previous_freight_type) 
    10 0.000 0.000 0.092 0.009 internals.py:4455(concatenate_block_managers) 
    10 0.001 0.000 0.088 0.009 internals.py:4471(<listcomp>) 
    120 0.001 0.000 0.084 0.001 internals.py:4559(concatenate_join_units) 
    266 0.004 0.000 0.067 0.000 common.py:733(take_nd) 
    120 0.000 0.000 0.061 0.001 internals.py:4569(<listcomp>) 
    120 0.003 0.000 0.061 0.001 internals.py:4814(get_reindexed_values) 
    1 0.000 0.000 0.059 0.059 get_prev_data_by_date.py:295(recreate_previous_appt_status) 
    10 0.000 0.000 0.038 0.004 merge.py:322(_get_join_info) 
    10 0.001 0.000 0.036 0.004 merge.py:516(_get_join_indexers) 
    25 0.001 0.000 0.024 0.001 merge.py:687(_factorize_keys) 
    74 0.023 0.000 0.023 0.000 {pandas.algos.take_2d_axis1_object_object} 
    50 0.022 0.000 0.022 0.000 {method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects} 
    120 0.003 0.000 0.022 0.000 internals.py:4479(get_empty_dtype_and_na) 
    88 0.000 0.000 0.021 0.000 frame.py:1969(__getitem__) 
    1 0.000 0.000 0.019 0.019 get_prev_data_by_date.py:328(recreate_previous_location_numbers) 
    39 0.000 0.000 0.018 0.000 internals.py:3495(reindex_indexer) 
    537 0.017 0.000 0.017 0.000 {built-in method numpy.core.multiarray.empty} 
    15 0.000 0.000 0.017 0.001 ops.py:725(wrapper) 
    15 0.000 0.000 0.015 0.001 frame.py:2011(_getitem_array) 
    24 0.000 0.000 0.014 0.001 internals.py:3625(take) 
    10 0.000 0.000 0.014 0.001 merge.py:157(__init__) 
    10 0.000 0.000 0.014 0.001 merge.py:382(_get_merge_keys) 
    15 0.008 0.001 0.013 0.001 ops.py:662(na_op) 
    234 0.000 0.000 0.013 0.000 common.py:158(isnull) 
    234 0.001 0.000 0.013 0.000 common.py:179(_isnull_new) 
    15 0.000 0.000 0.012 0.001 generic.py:1609(take) 
    20 0.000 0.000 0.012 0.001 generic.py:2191(reindex) 

通過剖析加入如下:

65079 function calls (63990 primitive calls) in 0.550 seconds 

Ordered by: cumulative time, internal time, call count 
List reduced from 592 to 40 due to restriction <40> 

ncalls tottime percall cumtime percall filename:lineno(function) 
    1 0.016 0.016 0.550 0.550 get_prev_data_by_date.py:122(merge_earlier_created_values) 
    14 0.000 0.000 0.295 0.021 generic.py:1901(_update_inplace) 
    14 0.000 0.000 0.295 0.021 generic.py:1402(_maybe_update_cacher) 
    19 0.000 0.000 0.294 0.015 generic.py:1492(_check_setitem_copy) 
    7 0.293 0.042 0.293 0.042 {built-in method gc.collect} 
    10 0.000 0.000 0.173 0.017 generic.py:1842(drop) 
    10 0.000 0.000 0.139 0.014 merge.py:26(merge) 
    8/4 0.000 0.000 0.138 0.034 decorators.py:65(wrapper) 
    4 0.000 0.000 0.138 0.034 frame.py:3028(drop_duplicates) 
    10 0.000 0.000 0.132 0.013 merge.py:201(get_result) 
    5 0.000 0.000 0.122 0.024 frame.py:4324(join) 
    5 0.000 0.000 0.122 0.024 frame.py:4371(_join_compat) 
    1 0.000 0.000 0.111 0.111 get_prev_data_by_date.py:264(recreate_previous_cartons) 
    1 0.000 0.000 0.103 0.103 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date) 
    1 0.000 0.000 0.099 0.099 get_prev_data_by_date.py:360(recreate_previous_freight_type) 
    10 0.000 0.000 0.093 0.009 internals.py:4455(concatenate_block_managers) 
    10 0.001 0.000 0.089 0.009 internals.py:4471(<listcomp>) 
    100 0.001 0.000 0.085 0.001 internals.py:4559(concatenate_join_units) 
    205 0.003 0.000 0.068 0.000 common.py:733(take_nd) 
    100 0.000 0.000 0.060 0.001 internals.py:4569(<listcomp>) 
    100 0.001 0.000 0.060 0.001 internals.py:4814(get_reindexed_values) 
    1 0.000 0.000 0.056 0.056 get_prev_data_by_date.py:295(recreate_previous_appt_status) 
    10 0.000 0.000 0.033 0.003 merge.py:322(_get_join_info) 
    52 0.031 0.001 0.031 0.001 {pandas.algos.take_2d_axis1_object_object} 
    5 0.000 0.000 0.030 0.006 base.py:2329(join) 
    37 0.001 0.000 0.027 0.001 internals.py:2754(apply) 
    6 0.000 0.000 0.024 0.004 frame.py:2763(set_index) 
    7 0.000 0.000 0.023 0.003 merge.py:516(_get_join_indexers) 
    2 0.000 0.000 0.022 0.011 base.py:2483(_join_non_unique) 
    7 0.000 0.000 0.021 0.003 generic.py:2950(copy) 
    7 0.000 0.000 0.021 0.003 internals.py:3046(copy) 
    84 0.000 0.000 0.020 0.000 frame.py:1969(__getitem__) 
    19 0.001 0.000 0.019 0.001 merge.py:687(_factorize_keys) 
    100 0.002 0.000 0.019 0.000 internals.py:4479(get_empty_dtype_and_na) 
    1 0.000 0.000 0.018 0.018 get_prev_data_by_date.py:328(recreate_previous_location_numbers) 
    15 0.000 0.000 0.017 0.001 ops.py:725(wrapper) 
    34 0.001 0.000 0.017 0.000 internals.py:3495(reindex_indexer) 
    83 0.004 0.000 0.016 0.000 internals.py:3211(_consolidate_inplace) 
    68 0.015 0.000 0.015 0.000 {method 'copy' of 'numpy.ndarray' objects} 
    15 0.000 0.000 0.015 0.001 frame.py:2011(_getitem_array) 

正如你所看到的,合併比連接速度更快,別看它小的值,但在4000次迭代,小值成爲一個龐大的數字,在幾分鐘內。

感謝

+0

將合併列設置爲索引,並使用'df1.join(df2)'代替。 –

回答

0

我建議您將合併列作爲索引,並使用df1.join(df2)代替merge,它的速度更快。

這裏包括分析一些例子:

In [1]: 
import pandas as pd 
import numpy as np 
df1 = pd.DataFrame(np.arange(1000000), columns=['A']) 
df1['B'] = np.random.randint(0,1000,(1000000)) 
df2 = pd.DataFrame(np.arange(1000000), columns=['A2']) 
df2['B2'] = np.random.randint(0,1000,(1000000)) 

這裏有A和A2定期左合併:

In [2]: %%timeit 
     x = df1.merge(df2, how='left', left_on='A', right_on='A2') 

1 loop, best of 3: 441 ms per loop 

這是同樣的,使用連接:

In [3]: %%timeit 
     x = df1.set_index('A').join(df2.set_index('A2'), how='left') 

1 loop, best of 3: 184 ms per loop 

現在很明顯如果您可以在循環之前設置索引,則時間上的增益會更大:

然後在循環中,你會得到的東西,在這種情況下是快30倍:在合併列確實加快這

In [5]: %%timeit 
     x = df1.join(df2, how='left') 
100 loops, best of 3: 14.3 ms per loop 
+0

這是一個左合併/加入。合併中的參數是如何「離開」的,這將與連接一起工作嗎? –

+0

這能解決你的問題嗎? –

+0

不知何故,我沒有看到我的數據集的性能有很大的改善。如果我將所有合併轉換爲連接,則時間會增加大約0.1-0.3秒。我將一些合併轉換爲連接,可以將時間縮短約0.2秒。任何東西,我錯過了?或者我需要像代碼一樣生成任何東西? –

0

set_index。下面是@ julien-marrec的稍微更現實的版本。

import pandas as pd 
import numpy as np 
myids=np.random.choice(np.arange(10000000), size=1000000, replace=False) 
df1 = pd.DataFrame(myids, columns=['A']) 
df1['B'] = np.random.randint(0,1000,(1000000)) 
df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2']) 
df2['B2'] = np.random.randint(0,1000,(1000000)) 

%%timeit 
    x = df1.merge(df2, how='left', left_on='A', right_on='A2') 
#1 loop, best of 3: 664 ms per loop 

%%timeit 
    x = df1.set_index('A').join(df2.set_index('A2'), how='left') 
#1 loop, best of 3: 354 ms per loop 

%%time 
    df1.set_index('A', inplace=True) 
    df2.set_index('A2', inplace=True) 
#Wall time: 16 ms 

%%timeit 
    x = df1.join(df2, how='left') 
#10 loops, best of 3: 80.4 ms per loop 

當要連接的列具有整數未在兩個表相同的順序你仍然可以期待的8倍大加快。