2017-01-22 34 views
2

新手試圖打破我對excel的沉迷。我有一個支付發票的數據集與供應商和支付的數額一起的國家。我想知道每個供應商,哪個國家他們有最大的發票金額和他們的總業務在該國的百分比。使用該數據集我想要得到的結果是:Extract row with max valueGetting max value using groupby在DataFrameGroupBy中提取最大值的行

2

Desired output

import pandas as pd 
import numpy as np 
df = pd.DataFrame({'Company' : ['bar','foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo', 'bar'], 
    'Country' : ['two','one', 'one', 'two', 'three', 'two', 'two', 'one', 'three', 'one'], 
    'Amount' : [4, 2, 2, 6, 4, 5, 6, 7, 8, 9], 
    'Pct' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}) 
CoCntry = df.groupby(['Company', 'Country']) 
CoCntry.aggregate(np.sum) 

看多的例子包括後Python : Getting the Row which has the max value in groups using groupby我已經得到儘可能創造一個DataFrameGroupBy總結各國的發票數據。我正在努力尋找最大的排。之後,我必須弄清楚如何計算百分比。忠告歡迎。

回答

2

可以使用transform通過第一級Company回報率爲組總和值SeriesPct。然後每小組最大值與idxmax和最後一個分Amount柱過濾DataframeSeriesPct

g = CoCntry.groupby(level='Company')['Amount'] 
Pct = g.transform('sum') 
print (Pct) 
Company Country 
bar  one  25 
     three  25 
     two  25 
foo  one  28 
     three  28 
     two  28 
Name: Amount, dtype: int64 

CoCntry = CoCntry.loc[g.idxmax()] 
print (CoCntry) 
       Amount Pct 
Company Country    
bar  one   11 0 
foo  two   11 0 

CoCntry.Pct = CoCntry.Amount.div(Pct) 
print (CoCntry.reset_index()) 
    Company Country Amount  Pct 
0  bar  one  11 0.440000 
1  foo  two  11 0.392857 

類似的另一種解決方案:

CoCntry = df.groupby(['Company', 'Country']).Amount.sum() 
print (CoCntry) 
Company Country 
bar  one  11 
     three  4 
     two  10 
foo  one   9 
     three  8 
     two  11 
Name: Amount, dtype: int64 

g = CoCntry.groupby(level='Company') 
Pct = g.sum() 
print (Pct) 
Company 
bar 25 
foo 28 
Name: Amount, dtype: int64 

maxCoCntry = CoCntry.loc[g.idxmax()].to_frame() 
maxCoCntry['Pct'] = maxCoCntry.Amount.div(Pct, level=0) 
print (maxCoCntry.reset_index()) 

    Company Country Amount  Pct 
0  bar  one  11 0.440000 
1  foo  two  11 0.392857 
+0

我不知道爲什麼,但聲明「G = CoCntry.groupby(水平=‘公司’)‘金額’]」觸發錯誤,「ttributeError:無法訪問調用屬性「 'DataFrameGroupBy'對象的'groupby',請嘗試使用'apply'方法「 – jones5322

+0

第二種解決方案效果很好。非常感謝。 – jones5322

+0

@AlbertJones - 我不知道什麼是問題,也許需要升級熊貓 - 在0.19.2中它是完美的。 – jezrael

2

設置

df = pd.DataFrame({'Company' : ['bar','foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo', 'bar'], 
    'Country' : ['two','one', 'one', 'two', 'three', 'two', 'two', 'one', 'three', 'one'], 
    'Amount' : [4, 2, 2, 6, 4, 5, 6, 7, 8, 9], 
    }) 

解決方案

# sum total invoice per country per company 
comp_by_country = df.groupby(['Company', 'Country']).Amount.sum() 

# sum total invoice per company 
comp_totals = df.groupby('Company').Amount.sum() 

# percent of per company per country invoice relative to company 
comp_by_country_pct = comp_by_country.div(comp_totals).rename('Pct') 

答案OP問題
其中'Country''Company'和最大總髮票是什麼公司業務總量的百分比。

comp_by_country_pct.loc[ 
    comp_by_country_pct.groupby(level=0).idxmax() 
].reset_index() 

    Company Country  Pct 
0  bar  one 0.440000 
1  foo  two 0.392857