如何在熊貓數據框上執行groupby而不會丟失其他列？

我有一個數據幀象下面這樣：如何在熊貓數據框上執行groupby而不會丟失其他列？

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'], 
      'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'], 
       'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'], 
     'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'], 
     'person_count': ['10','14','25','20','34','23','43','34','10','20'], 
     'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'], 
     'sir': ['a','a','a','a','b','b','b','b','c','c']}) 
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']] 

print df 

    sport_name person_name city person_symbol person_count  month sir 
0 football  ramesh mumbai   ram   10 2017-01-23 a 
1 football  ramesh mumbai   mum   14 2017-01-23 a 
2 football  ramesh delhi   mum   25 2017-01-23 a 
3 football  ramesh delhi   ram   20 2017-01-23 a 
4 football  ramesh mumbai   ram   34 2017-02-26 b 
5 football  ramesh mumbai   mum   23 2017-02-26 b 
6 football  ramesh delhi   mum   43 2017-02-26 b 
7 football  ramesh delhi   ram   34 2017-02-26 b 
8 basketball  mahesh pune   mah   10 2017-03-03 c 
9 basketball  mahesh nagpur   mah   20 2017-03-03 c

從這個數據幀，我希望創建命名爲「derived_symbol」和「person_count」兩個數據幀。爲了創建它，我需要把重點放在一些條件如下圖所示：

derived_symbol需要形成每個唯一的城市和person_symbol。
person_count是基於derived_symbol是什麼來計算。

爲我做了這上面的事情，它是工作的罰款：

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'], 
      'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'], 
       'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'], 
     'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'], 
     'person_count': ['10','14','25','20','34','23','43','34','10','20'], 
     'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'], 
     'sir': ['a','a','a','a','b','b','b','b','c','c']}) 
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']] 

df['person_count'] = df['person_count'].astype(int) 

df1=df.set_index(['sport_name','person_name','person_count','month','sir']).stack().reset_index(name='val') 

df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count' 

df2 = df1.groupby(['derived_symbol','month','sir','person_name'])['person_count'].sum().reset_index(name='person_count') 
print (df2)

上面的代碼的輸出：

  derived_symbol     month  sir sport_name person_name person_count 
0  basketball.mahesh.TOTAL.mah_count 2017-03-03 c basketball mahesh   30 
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c basketball mahesh   20 
2  basketball.mahesh.TOTAL.pune_count 2017-03-03 c basketball mahesh   10 
3  football.ramesh.TOTAL.delhi_count 2017-01-23 a football ramesh   45 
4  football.ramesh.TOTAL.delhi_count 2017-02-26 b football ramesh   77 
5  football.ramesh.TOTAL.mum_count 2017-01-23 a football ramesh   39 
6  football.ramesh.TOTAL.mum_count 2017-02-26 b football ramesh   66 
7  football.ramesh.TOTAL.mumbai_count 2017-01-23 a football ramesh   24 
8  football.ramesh.TOTAL.mumbai_count 2017-02-26 b football ramesh   57 
9  football.ramesh.TOTAL.ram_count 2017-01-23 a football ramesh   30 
10  football.ramesh.TOTAL.ram_count 2017-02-26 b football ramesh   68

不過，我想數據幀了另外兩列其中是「城市」和「person_symbol」象下面這樣：

      derived_symbol  month sir person_name sport_name person_count city  person_symbol 
0  basketball.mahesh.TOTAL.mah_count 2017-03-03 c  mahesh basketball 30   NO_ENTRY  mah 
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c  mahesh basketball 20   nagpur  NO_ENTRY 
2  basketball.mahesh.TOTAL.pune_count 2017-03-03 c  mahesh football  10   pune  NO_ENTRY 
3  football.ramesh.TOTAL.delhi_count 2017-01-23 a  ramesh football  45   delhi  NO_ENTRY 
4  football.ramesh.TOTAL.delhi_count 2017-02-26 b  ramesh football  77   delhi  NO_ENTRY 
5  football.ramesh.TOTAL.mum_count 2017-01-23 a  ramesh football  39   NO_ENTRY mum 
6  football.ramesh.TOTAL.mum_count 2017-02-26 b  ramesh football  66   NO_ENTRY mum 
7  football.ramesh.TOTAL.mumbai_count 2017-01-23 a  ramesh football  24   mumbai  NO_ENTRY 
8  football.ramesh.TOTAL.mumbai_count 2017-02-26 b  ramesh football  57   mumbai  NO_ENTRY 
9  football.ramesh.TOTAL.ram_count 2017-01-23 a  ramesh football  30   NO_ENTRY ram 
10  football.ramesh.TOTAL.ram_count 2017-02-26 b  ramesh football  68   NO_ENTRY ram

背後實際上創建了這兩個符號的邏輯是：

如果某個城市創建當前行則城市列包含城市價值和person_symbol包含「NO_ENTRY」。
如果當前行是爲特定的符號產生了以後person_symbol列包含person_symbol價值和城市包含NO_ENTRY。

我怎樣才能做到數據的操作等，而不會失去我以前的行爲？

來源

2017-09-13 kit

可以level_5和val第一列添加到groupby：

df2 = df1.groupby(['derived_symbol', 
        'month','sir', 
        'person_name', 
        'level_5', 
        'val'])['person_count'].sum().reset_index(name='person_count') 
print (df2) 
          derived_symbol  month sir person_name \ 
0  basketball.mahesh.TOTAL.mah_count 2017-03-03 c  mahesh 
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c  mahesh 
2  basketball.mahesh.TOTAL.pune_count 2017-03-03 c  mahesh 
3  football.ramesh.TOTAL.delhi_count 2017-01-23 a  ramesh 
4  football.ramesh.TOTAL.delhi_count 2017-02-26 b  ramesh 
5  football.ramesh.TOTAL.mum_count 2017-01-23 a  ramesh 
6  football.ramesh.TOTAL.mum_count 2017-02-26 b  ramesh 
7  football.ramesh.TOTAL.mumbai_count 2017-01-23 a  ramesh 
8  football.ramesh.TOTAL.mumbai_count 2017-02-26 b  ramesh 
9  football.ramesh.TOTAL.ram_count 2017-01-23 a  ramesh 
10  football.ramesh.TOTAL.ram_count 2017-02-26 b  ramesh 

      level_5  val person_count 
0 person_symbol  mah   30 
1   city nagpur   20 
2   city pune   10 
3   city delhi   45 
4   city delhi   77 
5 person_symbol  mum   39 
6 person_symbol  mum   66 
7   city mumbai   24 
8   city mumbai   57 
9 person_symbol  ram   30 
10 person_symbol  ram   68

然後通過unstack重塑背部，None轉換爲NO_ENTRY由fillna。

df3=df2.set_index(['derived_symbol', 
        'month', 
        'sir', 
        'person_name', 
        'person_count', 
        'level_5'])['val'] \ 
     .unstack() \ 
     .fillna('NO_ENTRY') \ 
     .rename_axis(None, 1) \ 
     .reset_index()

print (df3) 
          derived_symbol  month sir person_name \ 
0  basketball.mahesh.TOTAL.mah_count 2017-03-03 c  mahesh 
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c  mahesh 
2  basketball.mahesh.TOTAL.pune_count 2017-03-03 c  mahesh 
3  football.ramesh.TOTAL.delhi_count 2017-01-23 a  ramesh 
4  football.ramesh.TOTAL.delhi_count 2017-02-26 b  ramesh 
5  football.ramesh.TOTAL.mum_count 2017-01-23 a  ramesh 
6  football.ramesh.TOTAL.mum_count 2017-02-26 b  ramesh 
7  football.ramesh.TOTAL.mumbai_count 2017-01-23 a  ramesh 
8  football.ramesh.TOTAL.mumbai_count 2017-02-26 b  ramesh 
9  football.ramesh.TOTAL.ram_count 2017-01-23 a  ramesh 
10  football.ramesh.TOTAL.ram_count 2017-02-26 b  ramesh 

    person_count  city person_symbol 
0    30 NO_ENTRY   mah 
1    20 nagpur  NO_ENTRY 
2    10  pune  NO_ENTRY 
3    45  delhi  NO_ENTRY 
4    77  delhi  NO_ENTRY 
5    39 NO_ENTRY   mum 
6    66 NO_ENTRY   mum 
7    24 mumbai  NO_ENTRY 
8    57 mumbai  NO_ENTRY 
9    30 NO_ENTRY   ram 
10   68 NO_ENTRY   ram

來源

2017-09-13 06:13:42 jezrael

@ jezrael-我看到100個的不同列，而這樣做的最後一步拆散（）之類df2.set_index（[ 'derived_symbol'， '月'， '先生'， 'PERSON_NAME' ，'person_count'，'level_5']）['val']。unstack（）上的實時數據？ – kit

ooops，你的樣品有一些區別嗎？ – jezrael

@ jezrael-嘗試相同的命令df2.set_index（[ 'derived_symbol'， '月'， '先生'， 'PERSON_NAME'， 'level_5'， 'person_count']）[ 'VAL']。拆散（）通過改變參數位置。這是問題。解決了它。 – kit

如何在熊貓數據框上執行groupby而不會丟失其他列？

回答

相關問題