使用多列鍵重塑熊貓數據框

我有一個有兩列作爲唯一值（eng_id，日期）的熊貓數據框。我需要將其轉換爲以下形狀，並通過equipment_id唯一值和它們的測量值創建列。我怎樣才能做到這一點？使用多列鍵重塑熊貓數據框

From: 
    eng_id  date  equipment_id  measurement 
     1  2016-01  100     20 
     1  2016-01  200     46 
     1  2016-01  300     18 
     1  2016-04  200     33 
     1  2016-05  200     27 
     2  2016-01  300     9 
     2  2016-01  400     15 
     2  2016-05  400     65 
     2  2016-05  500     51 
     2  2016-05  600     16 

To: 

    ID   100  200  300  400  500  600 
1,2016-01  20  46  18  0  0  0 
1,2016-04  0  33  0   0  0  0 
1,2016-05  0  27  0   0  0  0 
2,2016-01  0   0  9   15  0  0 
2,2016-05  0   0  0   65  51  16

來源

2017-05-29 sina

Concanecate兩列，將ID和使用pivot：

df['ID'] = df['eng_id'].astype(str) + ',' + df['date'] 
df = df.pivot(index='ID', columns='equipment_id', values='measurement').fillna(0).astype(int) 
print (df) 
equipment_id 100 200 300 400 500 600 
ID           
1,2016-01  20 46 18 0 0 0 
1,2016-04  0 33 0 0 0 0 
1,2016-05  0 27 0 0 0 0 
2,2016-01  0 0 9 15 0 0 
2,2016-05  0 0 0 65 51 16

類似的解決方案與set_index + unstack：

df['ID'] = df['eng_id'].astype(str) + ',' + df['date'] 
df = df.set_index(['ID', 'equipment_id'])['measurement'].unstack(fill_value=0) 
print (df) 
equipment_id 100 200 300 400 500 600 
ID           
1,2016-01  20 46 18 0 0 0 
1,2016-04  0 33 0 0 0 0 
1,2016-05  0 27 0 0 0 0 
2,2016-01  0 0 9 15 0 0 
2,2016-05  0 0 0 65 51 16

但如果需要2列ID：

df = df.set_index(['eng_id', 'date', 'equipment_id'])['measurement'].unstack(fill_value=0) 
print (df) 
equipment_id 100 200 300 400 500 600 
eng_id date         
1  2016-01 20 46 18 0 0 0 
     2016-04 0 33 0 0 0 0 
     2016-05 0 27 0 0 0 0 
2  2016-01 0 0 9 15 0 0 
     2016-05 0 0 0 65 51 16

對於列添加reset_index + rename_axis：

df = df.set_index(['eng_id', 'date', 'equipment_id'])['measurement'].unstack(fill_value=0) 
     .reset_index() 
     .rename_axis(None, axis=1) 
print (df) 
    eng_id  date 100 200 300 400 500 600 
0  1 2016-01 20 46 18 0 0 0 
1  1 2016-04 0 33 0 0 0 0 
2  1 2016-05 0 27 0 0 0 0 
3  2 2016-01 0 0 9 15 0 0 
4  2 2016-05 0 0 0 65 51 16

但如果得到：

ValueError: Index contains duplicate entries, cannot reshape

它意味着你有重複，需要pivot_table與像mean一些聚合函數，sum ...：

print (df) 
    eng_id  date equipment_id measurement 
0  1 2016-01   100   20 <-duplicate 1 2016-01 100 
1  1 2016-01   100   30 <-duplicate 1 2016-01 100 
2  1 2016-01   200   46 
3  1 2016-01   300   18 
4  1 2016-04   200   33 
5  1 2016-05   200   27 
6  2 2016-01   300   9 
7  2 2016-01   400   15 
8  2 2016-05   400   65 
9  2 2016-05   500   51 
10  2 2016-05   600   16 

df['ID'] = df['eng_id'].astype(str) + ',' + df['date'] 
df = df.pivot_table(index='ID', 
        columns='equipment_id', 
        values='measurement', 
        fill_value=0, 
        aggfunc='mean') 
print (df) 
equipment_id 100 200 300 400 500 600 
ID           
1,2016-01  25 46 18 0 0 0 <= (20+30)/2=25 
1,2016-04  0 33 0 0 0 0 
1,2016-05  0 27 0 0 0 0 
2,2016-01  0 0 9 15 0 0 
2,2016-05  0 0 0 65 51 16

或者使用groupby + aggregate function + unstack：

df['ID'] = df['eng_id'].astype(str) + ',' + df['date'] 
df = df.groupby(['ID', 'equipment_id'])['measurement'].mean().unstack(fill_value=0) 
print (df) 
equipment_id 100 200 300 400 500 600 
ID           
1,2016-01  25 46 18 0 0 0 <= (20+30)/2=25 
1,2016-04  0 33 0 0 0 0 
1,2016-05  0 27 0 0 0 0 
2,2016-01  0 0 9 15 0 0 
2,2016-05  0 0 0 65 51 16

來源

2017-05-29 13:53:40 jezrael

@ jezrael的回答涵蓋了最習慣的方式來做到這一點。這只是對我的探索性練習。我想分享我發現的內容。該技術假設['eng_id', 'date', 'equipment_id']的組合是唯一的。

z = list(zip(df.eng_id.values.tolist(), df.date.values.tolist())) 
# i will be the positions I will use to insert into the values array 
# u will be the tuples that make up the index 
i, u = pd.Series(z).factorize() 
idx = pd.MultiIndex.from_tuples(u, names=['eng_id', 'date']) 
# j will bet be positions I will use to insert into the values array 
# col will be the column labels 
j, col = df.equipment_id.factorize() 

# Create a place holder dataframe 
d = pd.DataFrame(0, idx, col) 

# fill the values 
d.values[i, j] = df.measurement.values 

print(d) 

      100 200 300 400 500 600 
eng_id date         
1  2016-01 20 46 18 0 0 0 
     2016-04 0 33 0 0 0 0 
     2016-05 0 27 0 0 0 0 
2  2016-01 0 0 9 15 0 0 
     2016-05 0 0 0 65 51 16

定時
小數據
這可能看起來不同，大數據和我沒有測試它。

%%timeit 
z = list(zip(df.eng_id.values.tolist(), df.date.values.tolist())) 
i, u = pd.Series(z).factorize() 
idx = pd.MultiIndex.from_tuples(u, names=['eng_id', 'date']) 
j, col = df.equipment_id.factorize() 
 
d = pd.DataFrame(0, idx, col) 
 
d.values[i, j] = df.measurement.values 
1000 loops, best of 3: 885 µs per loop 

%timeit df.set_index(['eng_id', 'date', 'equipment_id'])['measurement'].unstack(fill_value=0) 
100 loops, best of 3: 1.96 ms per loop

來源

2017-05-29 14:34:39 piRSquared

使用多列鍵重塑熊貓數據框

回答

相關問題