2015-12-03 75 views
1

讓我們說我有這樣的結果熊貓:GROUP BY和排序總規模

group1 = df.groupby(['first_column', 'second_column'], as_index=False).size() 

first_column second_column 
A    A1    1 
       A2    2 
B    B1    1 
       B2    2 
       B3    3 

然後,我希望它計算總大小FIRST_COLUMN並顯示弄成這個樣子

first_column second_column  
A    A1    1   3 
       A2    2 
B    B1    1   6 
       B2    2 
       B3    3  

並根據總尺寸,我希望它被排序成前10位的最大總尺寸。我該如何做這樣的事情?也有可能給列的名稱。像這樣

first_column second_column size total_size 

更新1

數據幀應該是這樣的。

df.head() 

    first_column second_column 
0 A    A1 
1 A    A2 
2 A    A2 
3 B    B1 
4 B    B2 
5 B    B2 
6 B    B3 
7 B    B3 
8 B    B3 
+0

你能告訴你的DF? –

+0

@AntonProtopopov請參閱更新1 – Mrye

回答

2

代碼註釋應該是不言自明的。

# Sample data. 
df = pd.DataFrame({'first_column': ['A']*3 + ['B']*6, 'second_column': ['A1'] + ['A2']*2 + ['B1'] + ['B2']*2 + ['B3']*3}) 

# Create initial groupby, rename column to 'size' and reset index. 
gb = df.groupby(['first_column', 'second_column'], as_index=False).size() 
gb.name = 'size' 
gb = gb.reset_index() 

>>> gb 
    first_column second_column size 
0   A   A1  1 
1   A   A2  2 
2   B   B1  1 
3   B   B2  2 
4   B   B3  3 

# Use transform to sum the `size` by the first column only. 
gb['total_size'] = gb.groupby('first_column')['size'].transform('sum') 

>>> gb 
    first_column second_column size total_size 
0   A   A1  1   3 
1   A   A2  2   3 
2   B   B1  1   6 
3   B   B2  2   6 
4   B   B3  3   6 
0

輸入被延長,因爲測試:

first_column second_column 
0    A   A1 
1    A   A2 
2    A   A2 
3    A   A2 
4    A   A2 
5    A   A2 
6    A   A2 
7    A   A2 
8    A   A2 
9    C   A2 
10   C   A2 
11   C   A2 
12   B   B1 
13   H   B1 
14   B   B1 
15   B   B1 
16   C   B1 
17   C   B1 
18   C   B1 
19   D   B1 
20   D   B2 
21   B   B2 
22   B   B3 
23   B   B3 
24   B   B3 
25   E   B3 
26   E   B3 
27   E   B3 
28   B   B3 
29   F   B3 
30   B   B3 
31   G   B3 
32   B   B3 

我用reset_index與參數name

group1 = df.groupby(['first_column', 'second_column'], as_index=False).size().reset_index(name='size') 
group1['total_size']= group1.groupby('first_column')['size'].transform(sum) 
print group1 
    first_column second_column size total_size 
0    A   A1  1   9 
1    A   A2  8   9 
2    B   B1  3   10 
3    B   B2  1   10 
4    B   B3  6   10 
5    C   A2  3   6 
6    C   B1  3   6 
7    D   B1  1   2 
8    D   B2  1   2 
9    E   B3  3   3 
10   F   B3  1   1 
11   G   B3  1   1 
12   H   B1  1   1 

排序:

#get top 5 largest in column total_size 
print group1.nlargest(5,'total_size').reset_index(drop=True) 
    first_column second_column size total_size 
0   B   B1  3   10 
1   B   B2  1   10 
2   B   B3  6   10 
3   A   A1  1   9 
4   A   A2  8   9 

#sort df by column total_size 
print group1.sort_values('total_size', ascending=False).reset_index(drop=True) 
    first_column second_column size total_size 
0    B   B1  3   10 
1    B   B2  1   10 
2    B   B3  6   10 
3    A   A1  1   9 
4    A   A2  8   9 
5    C   A2  3   6 
6    C   B1  3   6 
7    E   B3  3   3 
8    D   B1  1   2 
9    D   B2  1   2 
10   F   B3  1   1 
11   G   B3  1   1 
12   H   B1  1   1 

通過total_size列獲取頂級行:

#get 5 top groups 
gb = group1.groupby('total_size', sort=False) 
new_gb = pd.concat([ gb.get_group(group) for group in gb.groups ][-5:]) 
new_gb = new_gb.sort_values('total_size', ascending=False).reset_index(drop=True) 
print new_gb 
    first_column second_column size total_size 
0   B   B1  3   10 
1   B   B2  1   10 
2   B   B3  6   10 
3   A   A1  1   9 
4   A   A2  8   9 
5   C   A2  3   6 
6   C   B1  3   6 
7   E   B3  3   3 
8   D   B1  1   2 
9   D   B2  1   2