2016-08-16 50 views
1

假設我們有以下的數據幀:如何從應用中正確返回格式化的熊貓數據框?

import pandas as pd 
import numpy as np 

years = [2005, 2006] 
location = ['city', 'suburb'] 
dft = pd.DataFrame({ 
    'year': [years[np.random.randint(0, 1+1)] for _ in range(100)], 
    'location': [location[np.random.randint(0, 1+1)] for _ in range(100)], 
    'days_to_complete': np.random.randint(100, high=600, size=100), 
    'cost_in_millions': np.random.randint(1, high=10, size=100) 
}) 

GROUPBY年和位置,然後將一個函數如下所示:

def get_custom_summary(group): 
    gt_200 = group.days_to_complete > 200 
    lt_200 = group.days_to_complete < 200 

    avg_days_gt200 = group[gt_200].days_to_complete.mean() 
    avg_cost_gt200 = group[gt_200].cost_in_millions.mean() 

    avg_days_lt200 = group[lt_200].days_to_complete.mean() 
    avg_cost_lt200 = group[lt_200].cost_in_millions.mean() 

    lt_200_prop = lt_200.sum()/(gt_200.sum() + lt_200.sum()) 

    return pd.DataFrame({ 
     'gt_200': {'AVG_DAYS': avg_days_gt200, 'AVG_COST': avg_cost_gt200}, 
     'lt_200': {'avg_days': avg_days_lt200, 'avg_cost': avg_cost_lt200}, 
     'lt_200_prop' : lt_200_prop 
    }) 

result = dft.groupby(['year', 'location']).apply(get_custom_summary) 

調用拆散(2)的結果,我們得到以下的輸出:

print(result.unstack(2)) 

       gt_200         lt_200        lt_200_prop        
       AVG_COST AVG_DAYS avg_cost avg_days AVG_COST AVG_DAYS avg_cost avg_days AVG_COST AVG_DAYS avg_cost avg_days 
year location                              
2005 city  4.818182 415.636364  NaN  NaN  NaN  NaN 7.250000 165.50 0.153846 0.153846 0.153846 0.153846 
    suburb 5.631579 336.631579  NaN  NaN  NaN  NaN 5.166667 140.50 0.240000 0.240000 0.240000 0.240000 
2006 city  4.130435 396.913043  NaN  NaN  NaN  NaN 5.750000 150.75 0.258065 0.258065 0.258065 0.258065 
    suburb 5.294118 392.823529  NaN  NaN  NaN  NaN 1.000000 128.00 0.055556 0.055556 0.055556 0.055556 

對於列gt_200lt_200dropna(axis=1)通話將祛瘀e填充了NaN的列,但lt_200_prop列仍然卡住了錯誤的列名稱。我怎樣才能從get_custom_summary返回一個DataFrame到列(gt_200,lt_200,lt_200_prop)?(沒有廣播(如果這是正確的話)子列(AVG_COST,AVG_DAYS,avg_cost,avg_days)?

編輯:

所需的輸出:

    gt_200    lt_200   lt_200_prop        
       AVG_COST AVG_DAYS avg_cost avg_days 
year location                              
2005 city  4.818182 415.636364 7.250000 165.50 0.153846 
    suburb 5.631579 336.631579 5.166667 140.50 0.240000 
2006 city  4.130435 396.913043 5.750000 150.75 0.258065 
    suburb 5.294118 392.823529 1.000000 128.00 0.055556 
+0

才能添加所需的輸出? – jezrael

+0

@jezrael剛剛添加了所需的輸出。 – Jay

回答

0

返回一個數據幀的列設置爲等於MultiIndex。

from collections import OrderedDict 

def get_multi_index(ordered_dict): 
    length = len(list(ordered_dict.values())[0]) 

    for k in ordered_dict: 
     assert(len(ordered_dict[k]) == length) 

    names = list() 
    arrays = list() 
    for k in ordered_dict: 
     names.append(k) 
     arrays.append(np.array(ordered_dict[k])) 

    tuples = list(zip(*arrays)) 
    return pd.MultiIndex.from_tuples(tuples, names=names) 

def get_custom_summary(group): 
    gt_200 = group.days_to_complete > 200 
    lt_200 = group.days_to_complete < 200 

    avg_days_gt_200 = group[gt_200].days_to_complete.mean() 
    avg_cost_gt_200 = group[gt_200].cost_in_millions.mean() 

    avg_days_lt_200 = group[lt_200].days_to_complete.mean() 
    avg_cost_lt_200 = group[lt_200].cost_in_millions.mean() 

    lt_200_prop = lt_200.sum()/(gt_200.sum() + lt_200.sum()) 

    ordered_dict = OrderedDict() 
    ordered_dict['first'] = ['lt_200', 'lt_200', 'gt_200', 'gt_200', 'lt_200_prop'] 
    ordered_dict['second'] = ['avg_cost', 'avg_days', 'AVG_COST', 'AVG_DAYS', 'prop'] 

    data = [[avg_cost_lt_200, avg_days_lt_200, avg_cost_gt_200, avg_days_gt_200, lt_200_prop]] 
    return pd.DataFrame(data, columns=get_multi_index(ordered_dict)) 

獲取並輸出結果:

result = dft.groupby(['year', 'location']).apply(get_custom_summary).xs(0, level=2) 
print(result) 

輸出:

first   lt_200    gt_200    lt_200_prop 
second   avg_cost avg_days AVG_COST AVG_DAYS  prop 
year location               
2005 city  7.555556 135.444444 5.300000 363.750000 0.310345 
    suburb 5.000000 137.333333 5.555556 444.222222 0.250000 
2006 city  6.250000 169.000000 4.714286 422.380952 0.160000 
    suburb 4.428571 133.142857 4.333333 445.666667 0.318182 
1

我的解決辦法是在功能上get_custom_summarygt_200lt_200使用相同的列名,然後按功能str.lower將其重命名,並添加自定義的最後一列名col

但有MultiIndex,所以你需要通過MultiIndex.from_tuples創造新:

years = [2005, 2006] 
location = ['city', 'suburb'] 
np.random.seed(1234) 
dft = pd.DataFrame({ 
    'year': [years[np.random.randint(0, 1+1)] for _ in range(100)], 
    'location': [location[np.random.randint(0, 1+1)] for _ in range(100)], 
    'days_to_complete': np.random.randint(100, high=600, size=100), 
    'cost_in_millions': np.random.randint(1, high=10, size=100) 
}) 

def get_custom_summary(group): 
    gt_200 = group.days_to_complete > 200 
    lt_200 = group.days_to_complete < 200 

    avg_days_gt200 = group[gt_200].days_to_complete.mean() 
    avg_cost_gt200 = group[gt_200].cost_in_millions.mean() 

    avg_days_lt200 = group[lt_200].days_to_complete.mean() 
    avg_cost_lt200 = group[lt_200].cost_in_millions.mean() 

    lt_200_prop = (lt_200).sum()/((gt_200).sum() + (lt_200).sum()) 

    return pd.DataFrame({ 
     'gt_200': {'AVG_DAYS': avg_days_gt200, 'AVG_COST': avg_cost_gt200}, 
     'lt_200': {'AVG_DAYS': avg_days_lt200, 'AVG_COST': avg_cost_lt200}, 
     'lt_200_prop' : lt_200_prop 
    }) 
result = dft.groupby(['year', 'location']).apply(get_custom_summary).unstack(2) 
#drop last column with duplicates values 
result = result.drop(result.columns[[-1]], axis=1) 

#rename columns names in level 1 
a = (result.columns.get_level_values(1)) 
level1 = a[:2].union(a[2:4].str.lower().union(['col'])) 
cols = list(zip(result.columns.get_level_values(0),level1)) 
result.columns = pd.MultiIndex.from_tuples(cols) 

print (result) 
       gt_200    lt_200    lt_200_prop 
       AVG_COST AVG_DAYS avg_cost avg_days   col 
year location               
2005 city  5.238095 392.095238 5.500000 144.666667 0.222222 
    suburb 4.428571 427.095238 4.000000 167.666667 0.125000 
2006 city  4.368421 406.789474 4.571429 150.142857 0.269231 
    suburb 4.000000 439.062500 4.142857 145.142857 0.304348 

Simplier解決方法是刪除列:

result = dft.groupby(['year', 'location']).apply(get_custom_summary).unstack(2) 
#drop last 3 column, then drop NaN columns 
result = result.drop(result.columns[[-1, -2, -3]], axis=1).dropna(axis=1) 
print (result) 
       gt_200    lt_200    lt_200_prop 
       AVG_COST AVG_DAYS avg_cost avg_days AVG_COST 
year location               
2005 city  5.238095 392.095238 5.500000 144.666667 0.222222 
    suburb 4.428571 427.095238 4.000000 167.666667 0.125000 
2006 city  4.368421 406.789474 4.571429 150.142857 0.269231 
    suburb 4.000000 439.062500 4.142857 145.142857 0.304348 
+0

雖然您的解決方案在這種情況下仍然有效,但如果我們在10列上使用不同的嵌套調用get_custom_summary,它似乎會變得雜亂無章。我確實將你的想法用於MultiIndex.from_tuples,但我在apply函數內部而不是外部使用它,到目前爲止,它似乎工作得很好。我會發布我在答案中所做的。 – Jay