2016-11-06 42 views
2

我有一個數據框,用於形成一個文件,通過這個文件我按兩列分組,這些列返回一個聚合計數。現在,我想最大的計數值進行排序,但是我得到以下錯誤:使用大熊貓進行計數和排序

KeyError: 'count'

看起來由AGG數列中的組是某種指數的所以不知道如何做到這一點,我是一個初學者到Python和熊貓。 下面是實際的代碼,請讓我知道如果你需要更多的細節:

def answer_five(): 
    df = census_df#.set_index(['STNAME']) 
    df = df[df['SUMLEV'] == 50] 
    df = df[['STNAME','CTYNAME']].groupby(['STNAME']).agg(['count']).sort(['count']) 
    #df.set_index(['count']) 
    print(df.index) 
    # get sorted count max item 
    return df.head(5) 

回答

10

我想你需要添加reset_index,然後參數ascending=Falsesort_values因爲sort回報:

FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....) .sort_values(['count'], ascending=False)

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] \ 
          .count() \ 
          .reset_index(name='count') \ 
          .sort_values(['count'], ascending=False) \ 
          .head(5) 

樣品:

df = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'), 
        'CTYNAME':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]}) 

print (df) 
    CTYNAME STNAME 
0   4  a 
1   5  b 
2   6  s 
3   5  c 
4   6  s 
5   2  c 
6   3  b 
7   4  c 
8   5  d 
9   6  b 
10  4  c 
11  5  s 
12  4  s 
13  3  c 
14  6  a 
15  5  e 

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] \ 
          .count() \ 
          .reset_index(name='count') \ 
          .sort_values(['count'], ascending=False) \ 
          .head(5) 

print (df) 
    STNAME count 
2  c  5 
5  s  4 
1  b  3 
0  a  2 
3  d  1 

但似乎你需要Series.nlargest

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'].count().nlargest(5) 

或:

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'].size().nlargest(5) 

The difference between size and count is:

size counts NaN values, count does not.

樣品:

df = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'), 
        'CTYNAME':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]}) 

print (df) 
    CTYNAME STNAME 
0   4  a 
1   5  b 
2   6  s 
3   5  c 
4   6  s 
5   2  c 
6   3  b 
7   4  c 
8   5  d 
9   6  b 
10  4  c 
11  5  s 
12  4  s 
13  3  c 
14  6  a 
15  5  e 

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] 
          .size() 
          .nlargest(5) 
          .reset_index(name='top5') 
print (df) 
    STNAME top5 
0  c  5 
1  s  4 
2  b  3 
3  a  2 
4  d  1 
+0

很好,謝謝你解釋各種選項 – Rubans

2

我不知道你的DF究竟是如何模樣。但是,如果你有一個由它計數幾個類別的頻率進行排序,很容易從DF切片A系列和排序的系列:

series = df.count().sort_values(ascending=False) 
series.head() 

注意,這個系列將使用類別爲索引的名稱!