2017-07-31 31 views
2

我有一個數據集,看起來像這樣 -熊貓 - 轉換一個類別列至二值編碼形式

 yyyy  month  tmax   tmin 
0 1908 January   5.0   -1.4 
1 1908 February   7.3   1.9 
2 1908  March   6.2   0.3 
3 1908  April   7.4   2.1 
4 1908  May  16.5   7.7 
5 1908  June  17.7   8.7 
6 1908  July  20.1   11.0 
7 1908  August  17.5   9.7 
8 1908 September  16.3   8.4 
9 1908 October  14.6   8.0 
10 1908 November   9.6   3.4 
11 1908 December   5.8   -0.3 
12 1909 January   5.0   0.1 
13 1909 February   5.5   -0.3 
14 1909  March   5.6   -0.3 
15 1909  April  12.2   3.3 
16 1909  May  14.7   4.8 
17 1909  June  15.0   7.5 
18 1909  July  17.3   10.8 
19 1909  August  18.8   10.7 
20 1909 September  14.5   8.1 
21 1909 October  12.9   6.9 
22 1909 November   7.5   1.7 
23 1909 December   5.3   0.4 
24 1910 January   5.2   -0.5 
... 

它有四個變量 - yyyymonthtmax(最高溫度)和tmin

我想在預測時使用月份列作爲變量,因此想將其轉換爲其二進制編碼版本。本質上,我想將12個變量添加到名爲January的數據集中,直到December,並且如果特定行的月份爲「1月」,則January列應該標記爲1,其餘新添加的11列應爲0

我看着數據透視表,但這並沒有幫助我的原因。任何想法如何以簡單優雅的方式做到這一點?

回答

5

我想你需要get_dummies

df = pd.get_dummies(df['month']) 

如果需要pop添加新列到原來並刪除month使用join

df2 = df.join(pd.get_dummies(df.pop('month'))) 
print (df2.head()) 
    yyyy tmax tmin April August December February January July June \ 
0 1908 5.0 -1.4  0  0   0   0  1  0  0 
1 1908 7.3 1.9  0  0   0   1  0  0  0 
2 1908 6.2 0.3  0  0   0   0  0  0  0 
3 1908 7.4 2.1  1  0   0   0  0  0  0 
4 1908 16.5 7.7  0  0   0   0  0  0  0 

    March May November October September 
0  0 0   0  0   0 
1  0 0   0  0   0 
2  1 0   0  0   0 
3  0 0   0  0   0 
4  0 1   0  0   0 

如果不需要刪除列month

df2 = df.join(pd.get_dummies(df['month'])) 
print (df2.head()) 
    yyyy  month tmax tmin April August December February January \ 
0 1908 January 5.0 -1.4  0  0   0   0  1 
1 1908 February 7.3 1.9  0  0   0   1  0 
2 1908  March 6.2 0.3  0  0   0   0  0 
3 1908  April 7.4 2.1  1  0   0   0  0 
4 1908  May 16.5 7.7  0  0   0   0  0 

    July June March May November October September 
0  0  0  0 0   0  0   0 
1  0  0  0 0   0  0   0 
2  0  0  1 0   0  0   0 
3  0  0  0 0   0  0   0 
4  0  0  0 1   0  0   0 

如果需要排序的列有多個可能的解決方案 - 使用reindexreindex_axis

months = ['January', 'February', 'March','April' ,'May', 'June', 'July', 'August', 'September','October', 'November','December'] 
df1 = pd.get_dummies(df['month']).reindex_axis(months, 1) 
print (df1.head()) 
    January February March April May June July August September \ 
0  1   0  0  0 0  0  0  0   0 
1  0   1  0  0 0  0  0  0   0 
2  0   0  1  0 0  0  0  0   0 
3  0   0  0  1 0  0  0  0   0 
4  0   0  0  0 1  0  0  0   0 

    October November December 
0  0   0   0 
1  0   0   0 
2  0   0   0 
3  0   0   0 
4  0   0   0 

df1 = pd.get_dummies(df['month']).reindex(columns=months) 
print (df1.head()) 
    January February March April May June July August September \ 
0  1   0  0  0 0  0  0  0   0 
1  0   1  0  0 0  0  0  0   0 
2  0   0  1  0 0  0  0  0   0 
3  0   0  0  1 0  0  0  0   0 
4  0   0  0  0 1  0  0  0   0 

    October November December 
0  0   0   0 
1  0   0   0 
2  0   0   0 
3  0   0   0 
4  0   0   0 

或轉換列monthordered categorical

df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True)) 
print (df1.head()) 
    January February March April May June July August September \ 
0  1   0  0  0 0  0  0  0   0 
1  0   1  0  0 0  0  0  0   0 
2  0   0  1  0 0  0  0  0   0 
3  0   0  0  1 0  0  0  0   0 
4  0   0  0  0 1  0  0  0   0 

    October November December 
0  0   0   0 
1  0   0   0 
2  0   0   0 
3  0   0   0 
4  0   0   0 
+1

感謝。 –

3

IIUC,

你可以使用assign**拆包操作者,和pd.get_dummies

df.assign(**pd.get_dummies(df['month'])) 

輸出:

yyyy  month tmax tmin April August December February January \ 
0 1908 January 5.0 -1.4  0  0   0   0  1 
1 1908 February 7.3 1.9  0  0   0   1  0 
2 1908  March 6.2 0.3  0  0   0   0  0 
3 1908  April 7.4 2.1  1  0   0   0  0 
4 1908  May 16.5 7.7  0  0   0   0  0 
5 1908  June 17.7 8.7  0  0   0   0  0 
6 1908  July 20.1 11.0  0  0   0   0  0 
7 1908  August 17.5 9.7  0  1   0   0  0 
8 1908 September 16.3 8.4  0  0   0   0  0 
9 1908 October 14.6 8.0  0  0   0   0  0 
10 1908 November 9.6 3.4  0  0   0   0  0 
11 1908 December 5.8 -0.3  0  0   1   0  0 
12 1909 January 5.0 0.1  0  0   0   0  1 
13 1909 February 5.5 -0.3  0  0   0   1  0 
14 1909  March 5.6 -0.3  0  0   0   0  0 
15 1909  April 12.2 3.3  1  0   0   0  0 
16 1909  May 14.7 4.8  0  0   0   0  0 
17 1909  June 15.0 7.5  0  0   0   0  0 
18 1909  July 17.3 10.8  0  0   0   0  0 
19 1909  August 18.8 10.7  0  1   0   0  0 
20 1909 September 14.5 8.1  0  0   0   0  0 
21 1909 October 12.9 6.9  0  0   0   0  0 
22 1909 November 7.5 1.7  0  0   0   0  0 
23 1909 December 5.3 0.4  0  0   1   0  0 
24 1910 January 5.2 -0.5  0  0   0   0  1 

    July June March May November October September 
0  0  0  0 0   0  0   0 
1  0  0  0 0   0  0   0 
2  0  0  1 0   0  0   0 
3  0  0  0 0   0  0   0 
4  0  0  0 1   0  0   0 
5  0  1  0 0   0  0   0 
6  1  0  0 0   0  0   0 
7  0  0  0 0   0  0   0 
8  0  0  0 0   0  0   1 
9  0  0  0 0   0  1   0 
10  0  0  0 0   1  0   0 
11  0  0  0 0   0  0   0 
12  0  0  0 0   0  0   0 
13  0  0  0 0   0  0   0 
14  0  0  1 0   0  0   0 
15  0  0  0 0   0  0   0 
16  0  0  0 1   0  0   0 
17  0  1  0 0   0  0   0 
18  1  0  0 0   0  0   0 
19  0  0  0 0   0  0   0 
20  0  0  0 0   0  0   1 
21  0  0  0 0   0  1   0 
22  0  0  0 0   1  0   0 
23  0  0  0 0   0  0   0 
24  0  0  0 0   0  0   0