2015-12-28 29 views
2

我有一個像下面的電子郵件和購買數據集。如何使用group by和返回空值的行

Email   Purchaser order_id amount 
[email protected] [email protected] 1   5 
[email protected]   
[email protected] [email protected] 2   10 
[email protected] [email protected] 3   5 

我想查找數據集中的總人數,購買人數以及訂單總數和總收入金額。我知道如何通過SQL使用left join和聚合函數來做到這一點,但我不知道如何使用Python/pandas來複制它。

對於Python,我試圖這樣使用pandasnumpy

table1 = table.groupby(['Email', 'Purchaser']).agg({'amount': np.sum, 'order_id': 'count'}) 

table1.agg({'Email': 'count', 'Purchaser': 'count', 'amount': np.sum, 'order_id': 'count'}) 

的問題是 - 它只是用命令(第1行第3日)返回行,但沒有其他的人(第2行)

Email   Purchaser  order_id amount 
[email protected] [email protected] 1   5 
[email protected] [email protected] 2   15 

SQL查詢應該是這樣的:

SELECT count(Email) as num_ind, count(Purchaser) as num_purchasers, sum(order) as orders , sum(amount) as revenue 
    FROM 
     (SELECT Email, Purchaser, count(order_id) as order, sum(amount) as amount 
     FROM table 1 
     GROUP BY Email, Purchaser) x 

如何在Python中複製它?

+0

是購買者是「Na或NaN'?如果是的話,你可以使用'dropna()'得到結果 – WoodChopper

+0

歡迎來到StackOverflow - 你可以閱讀[tour](http://stackoverflow.com/tour)。 – jezrael

回答

2

它現在不在熊貓中實現 - see

所以一個可怕的解決辦法是更換NaN一些字符串和agg後更換回NaN

table['Purchaser'] = table['Purchaser'].replace(np.nan, 'dummy') 
print table 
     Email Purchaser order_id amount 
0 [email protected] [email protected]   1  5 
1 [email protected]   NaN  NaN  NaN 
2 [email protected] [email protected]   2  10 
3 [email protected] [email protected]   3  5 

table['Purchaser'] = table['Purchaser'].replace(np.nan, 'dummy') 
print table 
     Email Purchaser order_id amount 
0 [email protected] [email protected]   1  5 
1 [email protected]  dummy  NaN  NaN 
2 [email protected] [email protected]   2  10 
3 [email protected] [email protected]   3  5 

table1 = table.groupby(['Email', 'Purchaser']).agg({'amount': np.sum, 'order_id': 'count'}) 
print table1 
         order_id amount 
Email  Purchaser      
[email protected] [email protected]   1  5 
[email protected] dummy    0  NaN 
[email protected] [email protected]   2  15 

table1 = table1.reset_index() 
table1['Purchaser'] = table1['Purchaser'].replace('dummy', np.nan) 
print table1 
     Email Purchaser order_id amount 
0 [email protected] [email protected]   1  5 
1 [email protected]   NaN   0  NaN 
2 [email protected] [email protected]   2  15 
+0

非常感謝!解決方案完美運作 –