2017-05-29 94 views
1

我有一個有兩列作爲唯一值(eng_id,日期)的熊貓數據框。我需要將其轉換爲以下形狀,並通過equipment_id唯一值和它們的測量值創建列。我怎樣才能做到這一點?使用多列鍵重塑熊貓數據框

From: 
    eng_id  date  equipment_id  measurement 
     1  2016-01  100     20 
     1  2016-01  200     46 
     1  2016-01  300     18 
     1  2016-04  200     33 
     1  2016-05  200     27 
     2  2016-01  300     9 
     2  2016-01  400     15 
     2  2016-05  400     65 
     2  2016-05  500     51 
     2  2016-05  600     16 

To: 

    ID   100  200  300  400  500  600 
1,2016-01  20  46  18  0  0  0 
1,2016-04  0  33  0   0  0  0 
1,2016-05  0  27  0   0  0  0 
2,2016-01  0   0  9   15  0  0 
2,2016-05  0   0  0   65  51  16 

回答

2

Concanecate兩列,將ID和使用pivot

df['ID'] = df['eng_id'].astype(str) + ',' + df['date'] 
df = df.pivot(index='ID', columns='equipment_id', values='measurement').fillna(0).astype(int) 
print (df) 
equipment_id 100 200 300 400 500 600 
ID           
1,2016-01  20 46 18 0 0 0 
1,2016-04  0 33 0 0 0 0 
1,2016-05  0 27 0 0 0 0 
2,2016-01  0 0 9 15 0 0 
2,2016-05  0 0 0 65 51 16 

類似的解決方案與set_index + unstack

df['ID'] = df['eng_id'].astype(str) + ',' + df['date'] 
df = df.set_index(['ID', 'equipment_id'])['measurement'].unstack(fill_value=0) 
print (df) 
equipment_id 100 200 300 400 500 600 
ID           
1,2016-01  20 46 18 0 0 0 
1,2016-04  0 33 0 0 0 0 
1,2016-05  0 27 0 0 0 0 
2,2016-01  0 0 9 15 0 0 
2,2016-05  0 0 0 65 51 16 

但如果需要2ID

df = df.set_index(['eng_id', 'date', 'equipment_id'])['measurement'].unstack(fill_value=0) 
print (df) 
equipment_id 100 200 300 400 500 600 
eng_id date         
1  2016-01 20 46 18 0 0 0 
     2016-04 0 33 0 0 0 0 
     2016-05 0 27 0 0 0 0 
2  2016-01 0 0 9 15 0 0 
     2016-05 0 0 0 65 51 16 

對於列添加reset_index + rename_axis

df = df.set_index(['eng_id', 'date', 'equipment_id'])['measurement'].unstack(fill_value=0) 
     .reset_index() 
     .rename_axis(None, axis=1) 
print (df) 
    eng_id  date 100 200 300 400 500 600 
0  1 2016-01 20 46 18 0 0 0 
1  1 2016-04 0 33 0 0 0 0 
2  1 2016-05 0 27 0 0 0 0 
3  2 2016-01 0 0 9 15 0 0 
4  2 2016-05 0 0 0 65 51 16 

但如果得到:

ValueError: Index contains duplicate entries, cannot reshape

它意味着你有重複,需要pivot_table與像mean一些聚合函數,sum ...:

print (df) 
    eng_id  date equipment_id measurement 
0  1 2016-01   100   20 <-duplicate 1 2016-01 100 
1  1 2016-01   100   30 <-duplicate 1 2016-01 100 
2  1 2016-01   200   46 
3  1 2016-01   300   18 
4  1 2016-04   200   33 
5  1 2016-05   200   27 
6  2 2016-01   300   9 
7  2 2016-01   400   15 
8  2 2016-05   400   65 
9  2 2016-05   500   51 
10  2 2016-05   600   16 

df['ID'] = df['eng_id'].astype(str) + ',' + df['date'] 
df = df.pivot_table(index='ID', 
        columns='equipment_id', 
        values='measurement', 
        fill_value=0, 
        aggfunc='mean') 
print (df) 
equipment_id 100 200 300 400 500 600 
ID           
1,2016-01  25 46 18 0 0 0 <= (20+30)/2=25 
1,2016-04  0 33 0 0 0 0 
1,2016-05  0 27 0 0 0 0 
2,2016-01  0 0 9 15 0 0 
2,2016-05  0 0 0 65 51 16 

或者使用groupby + aggregate function + unstack

df['ID'] = df['eng_id'].astype(str) + ',' + df['date'] 
df = df.groupby(['ID', 'equipment_id'])['measurement'].mean().unstack(fill_value=0) 
print (df) 
equipment_id 100 200 300 400 500 600 
ID           
1,2016-01  25 46 18 0 0 0 <= (20+30)/2=25 
1,2016-04  0 33 0 0 0 0 
1,2016-05  0 27 0 0 0 0 
2,2016-01  0 0 9 15 0 0 
2,2016-05  0 0 0 65 51 16 
0

@ jezrael的回答涵蓋了最習慣的方式來做到這一點。 這只是對我的探索性練習。我想分享我發現的內容。該技術假設['eng_id', 'date', 'equipment_id']的組合是唯一的。

z = list(zip(df.eng_id.values.tolist(), df.date.values.tolist())) 
# i will be the positions I will use to insert into the values array 
# u will be the tuples that make up the index 
i, u = pd.Series(z).factorize() 
idx = pd.MultiIndex.from_tuples(u, names=['eng_id', 'date']) 
# j will bet be positions I will use to insert into the values array 
# col will be the column labels 
j, col = df.equipment_id.factorize() 

# Create a place holder dataframe 
d = pd.DataFrame(0, idx, col) 

# fill the values 
d.values[i, j] = df.measurement.values 

print(d) 

      100 200 300 400 500 600 
eng_id date         
1  2016-01 20 46 18 0 0 0 
     2016-04 0 33 0 0 0 0 
     2016-05 0 27 0 0 0 0 
2  2016-01 0 0 9 15 0 0 
     2016-05 0 0 0 65 51 16 

定時
小數據
這可能看起來不同,大數據和我沒有測試它。

%%timeit 
z = list(zip(df.eng_id.values.tolist(), df.date.values.tolist())) 
i, u = pd.Series(z).factorize() 
idx = pd.MultiIndex.from_tuples(u, names=['eng_id', 'date']) 
j, col = df.equipment_id.factorize() 
​ 
d = pd.DataFrame(0, idx, col) 
​ 
d.values[i, j] = df.measurement.values 
1000 loops, best of 3: 885 µs per loop 

%timeit df.set_index(['eng_id', 'date', 'equipment_id'])['measurement'].unstack(fill_value=0) 
100 loops, best of 3: 1.96 ms per loop