2017-07-06 62 views
1

給定一個數據框,每個組中有不同數量的元素('groupby'由一些變量),我需要重新塑造一個預定義數量的列的矩陣。例如:重塑GroupBy在熊貓和墊如果失蹤

summary_x participant_id_x response_date cuts 
0   3.0    11 2016-05-05 a 
1   3.0    11 2016-05-06 a 
2   4.0    11 2016-05-07 a 
3   4.0    11 2016-05-08 a 
4   3.0    11 2016-05-09 a 
5   3.0    11 2016-05-10 a 
6   3.0    11 2016-05-11 a 
7   3.0    11 2016-05-12 a 
8   3.0    11 2016-05-13 a 
9   3.0    11 2016-05-14 a 
13  4.0    11 2016-05-22 b 
14  4.0    11 2016-05-23 b 
15  3.0    11 2016-05-24 b 
16  3.0    11 2016-05-25 b 
17  3.0    11 2016-05-26 b 
18  3.0    11 2016-05-27 b 
19  3.0    11 2016-05-28 b 
20  3.0    11 2016-06-02 c 
21  3.0    11 2016-06-03 c 
22  3.0    11 2016-06-04 c 
23  3.0    11 2016-06-05 c 
24  3.0    11 2016-06-06 c 
25  3.0    11 2016-06-07 c 
26  3.0    11 2016-06-08 c 
27  3.0    11 2016-06-09 c 
28  3.0    11 2016-06-10 c 
29  5.0    11 2016-06-11 c 

每個組(by'cuts'),包含10個元素,但該組‘B’只包含7.我想有一個矩陣從‘summary_x’再成形爲(3,10 ),其中缺失值將用nans填充:

pd.DataFrame(df.summary_x.values.reshape((-1,10))) 

     0 1 2 3 4 5 6 7 8 9 
0 3.0 3.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 3.0 
1 nan nan nan 4.0 4.0 3.0 3.0 3.0 3.0 3.0 
2 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 5.0 

任何解決方案?

回答

1

您可以使用cumcount[::-1]的列的變更單(行):

g = df.groupby('cuts').cumcount(ascending=False) 
df = pd.pivot(index=df['cuts'], columns=g, values=df['summary_x']).iloc[:,::-1] 
     .reset_index(drop=True) 
df.columns = np.arange(len(df.columns)) 
print (df) 
    0 1 2 3 4 5 6 7 8 9 
0 3.0 3.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 3.0 
1 NaN NaN NaN 4.0 4.0 3.0 3.0 3.0 3.0 3.0 
2 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 5.0 

另一種解決方案:

L = df[::-1].groupby('cuts')['summary_x'].apply(list).values.tolist() 
df = pd.DataFrame(L).iloc[:, ::-1] 
df.columns = np.arange(len(df.columns)) 
print (df) 
    0 1 2 3 4 5 6 7 8 9 
0 3.0 3.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 3.0 
1 NaN NaN NaN 4.0 4.0 3.0 3.0 3.0 3.0 3.0 
2 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 5.0 

但如果NaN可以均爲到底還:

g = df.groupby('cuts').cumcount() 
df = pd.pivot(index=df['cuts'], columns=g, values=df['summary_x']).reset_index(drop=True) 

print (df) 
    0 1 2 3 4 5 6 7 8 9 
0 3.0 3.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 3.0 
1 4.0 4.0 3.0 3.0 3.0 3.0 3.0 NaN NaN NaN 
2 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 5.0