2017-03-02 31 views
1

我有以下熊貓數據幀:如何統計熊貓數據框中分類數據的子組?

import pandas as pd 
import numpy as np 
df = pd.DataFrame({"shops": ["shop1", "shop2", "shop3", "shop4", "shop5", "shop6"], "franchise" : ["franchise_A", "franchise_A", "franchise_A", "franchise_A", "franchise_B", "franchise_B"],"items" : ["dog", "cat", "dog", "dog", "bird", "fish"]}) 
df = df[["shops", "franchise", "items"]] 
print(df) 

    shops franchise items 
0 shop1 franchise_A dog 
1 shop2 franchise_A cat 
2 shop3 franchise_A dog 
3 shop4 franchise_A dog 
4 shop5 franchise_B bird 
5 shop6 franchise_B fish 

所以,每行是一個獨特樣品shop1shop2等由此每個樣本屬於子組franchise_Afranchise_Bfranchise_C等 在items柱,只有四種可能的分類值:dog,cat,fish,bird。我的動機是爲每個「特許經營」創建dog,cat,fish,bird的數量的條形圖。

我想輸出是

franchise  dogs cats birds fish 
franchise_A  3  1  0  0 
franchise_B  0  0  1  1 

我相信,我首先要使用groupby(),例如

df.groupby("franchise").count() 
      shops items 
franchise     
franchise_A  4  4 
franchise_B  2  2 

但我不知道如何計算每個特許經營項目的數量。

回答

3

您可以使用value_countsunstack,感謝Nickil Maveli

from collections import Counter 

print (df.groupby("franchise")['items'].value_counts().unstack(fill_value=0)) 
items  bird cat dog fish 
franchise       
franchise_A  0 1 3  0 
franchise_B  1 0 0  1 

crosstab和另一種解決方案pivot_table

print (pd.crosstab(df["franchise"], df['items'])) 
items  bird cat dog fish 
franchise       
franchise_A  0 1 3  0 
franchise_B  1 0 0  1 

print (df.pivot_table(index="franchise", columns='items', aggfunc='size', fill_value=0)) 
items  bird cat dog fish 
franchise       
franchise_A  0 1 3  0 
franchise_B  1 0 0  1 
+2

'value_counts的()''而不是將Counter'真的緊了整個事情了。 –

+1

@NickilMaveli - 謝謝。 – jezrael

+0

這是一個單獨的問題:假設有5個類別,其中一個是'NaN'。我如何將NaN值作爲一個單獨的類別? 'df.groupby(「franchise」)['items']。value_counts()。unstack(fill_value = 0)'不會這樣做。 – ShanZhengYang

2

你可以包括在t時的items列他groupby,然後用size

>>> df.groupby(['franchise', 'items']).size().unstack(fill_value=0) 

items  bird cat dog fish 
franchise       
franchise_A  0 1 3  0 
franchise_B  1 0 0  1 

粗糙)基準

%timeit df.groupby(['franchise', 'items']).size().unstack(fill_value=0) 
100 loops, best of 3: 2.73 ms per loop 

%timeit (df.groupby("franchise")['items'].apply(Counter).unstack(fill_value=0).astype(int)) 
100 loops, best of 3: 4.18 ms per loop 

%timeit df.groupby('franchise')['items'].value_counts().unstack(fill_value=0) 
100 loops, best of 3: 2.71 ms per loop