2017-04-11 76 views
0

應用我有一個數據幀融化看起來像這樣:大熊貓自定義函數在數據幀融化

date   group metric n_events total_users 
0 2017-01-01 control metric1 33.919910 827.416818 
27 2017-01-01 variant1 metric1 55.141467 780.840083 
54 2017-01-01 variant2 metric1 63.045587 436.381533 
1 2017-01-02 control metric2 74.013340 145.551779 
28 2017-01-02 variant1 metric2 78.539663 553.410827 

我要計算在熔化的數據幀的一些隆起指標。到目前爲止,我對數據幀進行了調整,這並不理想。

import pandas as pd 

df = pd.DataFrame(
    {'group': sorted(['control','variant1','variant2']*27), 
    'metric': ['metric1', 'metric2', 'metric3']*27, 
    'n_events': np.random.uniform(100,20,size=81), 
    'total_users': np.random.uniform(1000, 20, size=81), 
    'date' : list(pd.date_range('1/1/2017', periods=27, freq='D'))*3 
    }) 

df = df.sort_values(['date','group','metric']) 

t = pd.pivot_table(df, values=['n_events','total_users'], 
       index=['date','metric'], 
       columns=['group'], 
       aggfunc=np.sum).reset_index() 

for var in ['variant1','variant2']: 
    uplift_colname = var + "_standard_uplift" 

# adding daily uplift 
    t[uplift_colname] =(t['n_events'][var]/t['total_users'][var])-\ 
          (t['n_events']['control']/t['total_users']['control']) 

我找得到擡升,而無需轉動數據幀,從而保持熔化的數據格式的更好的方式。我試着用groupby或使用自定義函數一起apply,即

df.groupby(['date','metric'])['n_events','group','total_users'].apply(myfxn) 
+0

您能提供一個期望結果的例子嗎? – greole

回答

2
def proc(df): 
    s = df.groupby('group').sum() 
    r = s.n_events/s.total_users 
    return r.drop('control').sub(r.loc['control']) 

gcols = ['date', 'metric'] 
ocols = ['group', 'n_events', 'total_users'] 
suffix = '_standard_uplift' 
df.groupby(gcols)[ocols].apply(proc).add_suffix(suffix) 

這得到您的當前t得到相同的信息在

group    variant1_standard_uplift variant2_standard_uplift 
date  metric              
2017-01-01 metric1     -0.175006     -0.334146 
2017-01-02 metric2     0.213414     0.007030 
2017-01-03 metric3     0.041405     0.913016 
2017-01-04 metric1     -0.102361     -0.044124 
2017-01-05 metric2     0.114260     0.031469 
2017-01-06 metric3     0.316760     -0.113277 
2017-01-07 metric1     3.049462     0.052456 
2017-01-08 metric2     -0.050300     -0.015628 
2017-01-09 metric3     0.004769     0.239641 
2017-01-10 metric1     0.025574     0.153893 
2017-01-11 metric2     0.111758     0.083404 
2017-01-12 metric3     -0.175687     -0.107851 
2017-01-13 metric1     0.147153     0.266303 
2017-01-14 metric2     -0.162214     -0.238798 
2017-01-15 metric3     0.137627     0.010475 
2017-01-16 metric1     -0.223583     -0.208177 
2017-01-17 metric2     0.154821     0.189663 
2017-01-18 metric3     -0.161725     -0.536955 
2017-01-19 metric1     -0.002525     0.027977 
2017-01-20 metric2     -0.210697     0.564725 
2017-01-21 metric3     -0.228038     -0.255461 
2017-01-22 metric1     -0.210647     -0.141039 
2017-01-23 metric2     0.354086     -0.366433 
2017-01-24 metric3     0.344310     -0.045895 
2017-01-25 metric1     0.340080     0.105040 
2017-01-26 metric2     2.512369     -0.062200 
2017-01-27 metric3     -1.326842     -1.819911 

爲了保持相同的數據幀的df但附加了兩個新列...

def proc(df): 
    s = df.groupby('group').sum() 
    r = s.n_events/s.total_users 
    return r.drop('control').sub(r.loc['control']) 

gcols = ['date', 'metric'] 
ocols = ['group', 'n_events', 'total_users'] 
suffix = '_standard_uplift' 
df.join(df.groupby(gcols)[ocols].apply(proc).add_suffix(suffix), on=gcols).sort_index() 

enter image description here

+0

謝謝,r.drop('control')。sub(r.loc ['control']'做什麼? – TinaW

+0

'r'將是一個帶有3個索引的'pd.Series''['control' ,'variant1','variant2']'。r.drop('control')'將刪除與索引「control」關聯的條目,留下另外兩個,然後減去與「control」 '通過'.sbub(r.loc ['control'])'' – piRSquared