2017-04-18 53 views
0

我有一個pandas.DataFrame,有幾列,其中一些具有連續數據,另一些具有分類。我一直試圖按類別先組合,然後在每個類別內根據條件(即兩個數字之間的值)拆分爲數組。根據條件對數據框的行進行排序,並根據其他條件將數據塊分割爲數組

這是我寫的一個蠻力hackjob,完成這項工作,但我想知道如果有更優雅的方式。

import pandas as pd 

df = pd.DataFrame({'Category1' : [ 0.3, 3.0, 12.4, 7.4, 
              20.3, 15.0, 10.9, 17.4], 
          'Category2' : [ 0, 0, 1, 0, 
               1, 1, 0, 0], 
          'Category3' : [ 1, 2, 3, 4, 
               5, 6, 7, 8], 
          'Category4' : ['foo','bar','fizz','buzz', 
              'spam','nii','blah','lol'], 
          etc.,         }) 

group_0_5 = df['Category1']<=5.0 
group_5_10 = (df['Category1']>5.0) & (df['Category1']<=10.0) 
group_10_15 = (df['Category1']>10.0) & (df['Category1']<=15.0) 
group_15_20 = (df['Category1']>15.0) & df['Category1']<=20.0) 
group_20_25 = (df['Category1']>20.0) & (df['Category1']<=25.0) 

state1 = (df['Category2']==1) 
state2 = (df['Category2']==0) 

count1_state1 = df.loc[group_0_5 & state1]['Category3'].count() 
count2_state1 = df.loc[group_5_10 & state1]['Category3'].count() 
count3_state1 = df.loc[group_10_15 & state1]['Category3'].count() 
count4_state1 = df.loc[group_15_20 & state1]['Category3'].count() 
count5_state1 = df.loc[group_20_25 & state1]['Category3'].count() 

count1_state2 = df.loc[group_0_5 & state2]['Category3'].count() 
count2_state2 = df.loc[group_5_10 & state2]['Category3'].count() 
count3_state2 = df.loc[group_10_15 & state2]['Category3'].count() 
count4_state2 = df.loc[group_15_20 & state2]['Category3'].count() 
count5_state2 = df.loc[group_20_25 & state2]['Category3'].count() 

count_array1=[count1_state1, count2_state1, count3_state1, count4_state1, count5_state1] 

count_array2=[count1_state2, count2_state2, count3_state2, count4_state2, count5_state2] 

print (count_array1) 
print (count_array2) 

Out [2]: 
[nan, nan, 2, 1, 1] 
[ 2, 1, 1, 1, nan] 

回答

3

我認爲你需要cutbinsCategory2groupby分檔與聚集countreindex添加缺少的值:

bins = [-np.inf, 5, 10, 15, 20, 25, np.inf] 
bins = pd.cut(df['Category1'], bins=bins) 

mux = pd.MultiIndex.from_product([bins.unique(), df['Category2'].unique()]) 
a = df.groupby([bins, df['Category2']])['Category3'].count().reindex(mux).unstack(0) 
print (a) 
    (-inf, 5] (5, 10] (10, 15] (15, 20] (20, 25] 
0  2.0  1.0  1.0  1.0  NaN 
1  NaN  NaN  2.0  NaN  1.0 

#select by categories of column Category2 
print (a.loc[0].values) 
[ 2. 1. 1. 1. nan] 

print (a.loc[1].values) 
[ nan nan 2. nan 1.] 

如果需要更換NaN0附加參數fill_value=0reindex

mux = pd.MultiIndex.from_product([bins.unique(), df['Category2'].unique()]) 
a = df.groupby([bins, df['Category2']])['Category3'].count() 
     .reindex(mux, fill_value=0) 
     .unstack(0) 
print (a) 
    (-inf, 5] (5, 10] (10, 15] (15, 20] (20, 25] 
0   2  1   1   1   0 
1   0  0   2   0   1 

print (a.loc[0].values) 
[2 1 1 1 0] 

print (a.loc[1].values) 
[0 0 2 0 1] 

另請檢查What is the difference between size and count in pandas?

+0

謝謝!切割方法正是我期待的細化這個代碼。 –

+0

很高興能爲您提供幫助。如果我的回答很有幫助,請不要忘記[接受](http://meta.stackexchange.com/a/5235/295067)它。謝謝。 – jezrael

2

使用panda.cut()pandas.DataFrame.groupby您可以根據需要收集的元素:

代碼:

groups = df.groupby(pd.cut(df['Category1'], [0, 5, 10, 15, 20, 25])) 

group_size = groups['Category2'].count().values 
group_ones = groups['Category2'].sum().values 

print(list(group_ones)) 
print(list(group_size - group_ones)) 

結果:

[0, 0, 2, 0, 1] 
[2, 1, 1, 1, 0] 
0

再一次,pd.cut與groupby和set_index

df = df.groupby([pd.cut(df['Category1'], bins=bins, right = True), 'Category2']).Category3.count().reset_index() 
df = df.set_index(['Category1', 'Category2']).unstack().reset_index(-1,drop=True) 

count_array1 = df.loc[:, ('Category3', 1)].tolist() 
print(count_array1) 

[nan, nan, 2.0, nan, 1.0] 


count_array2 = df.loc[:, ('Category3', 0)].tolist() 
print(count_array2) 

[2.0, 1.0, 1.0, 1.0, nan] 
相關問題