2017-09-26 64 views
1

我正在閱讀一本關於使用Python進行機器學習的簡介。這裏的作者如下所示 假設對於工作類別功能,我們可能具有「Government Employee」,「Private Employee」,「Self Employed」和「Self Employed Incorpora ted」的值。get_dummies在熊貓中的用法

print("Original features:\n", list(data.columns), "\n") 

data_dummies = pd.get_dummies(data) 

print("Features after get_dummies:\n", list(data_dummies.columns)) 

Original features: 
['age', 'workclass'] 

Features after get_dummies: 
['age', 'workclass_ ?', 'workclass_ Government Employee', 'workclass_Private Employee', 'workclass_Self Employed', 'workclass_Self Employed Incorporated'] 

我的問題是什麼是新的列workclass_?

回答

2

它與workclass列的字符串值創建:

data = pd.DataFrame({'age':[1,1,1,2,1,1], 
        'workclass':['Government Employee','Private Employee','Self Employed','Self Employed Incorpora ted','Self Employed Incorpora ted','?']}) 

print (data) 
    age     workclass 
0 1   Government Employee 
1 1    Private Employee 
2 1    Self Employed 
3 2 Self Employed Incorpora ted 
4 1 Self Employed Incorpora ted 
5 1       ? 

data_dummies = pd.get_dummies(data) 
print (data_dummies) 
    age workclass_? workclass_Government Employee \ 
0 1   0        1 
1 1   0        0 
2 1   0        0 
3 2   0        0 
4 1   0        0 
5 1   1        0 

    workclass_Private Employee workclass_Self Employed \ 
0       0      0 
1       1      0 
2       0      1 
3       0      0 
4       0      0 
5       0      0 

    workclass_Self Employed Incorpora ted 
0          0 
1          0 
2          0 
3          1 
4          1 
5          0 

如果有相同的價值觀多列這個前綴是真正的幫助:

data = pd.DataFrame({'age':[1,1,3], 
        'workclass':['Government Employee','Private Employee','?'], 
        'workclass1':['Government Employee','Private Employee','Self Employed']}) 

print (data) 
    age   workclass   workclass1 
0 1 Government Employee Government Employee 
1 1  Private Employee  Private Employee 
2 3     ?  Self Employed 

data_dummies = pd.get_dummies(data) 
print (data_dummies) 
    age workclass_? workclass_Government Employee \ 
0 1   0        1 
1 1   0        0 
2 3   1        0 

    workclass_Private Employee workclass1_Government Employee \ 
0       0        1 
1       1        0 
2       0        0 

    workclass1_Private Employee workclass1_Self Employed 
0       0       0 
1       1       0 
2       0       1 

如果不要需要它,可以添加參數以覆蓋空白空間:

data_dummies = pd.get_dummies(data, prefix='', prefix_sep='') 
print (data_dummies) 
    age ? Government Employee Private Employee Government Employee \ 
0 1 0     1     0     1 
1 1 0     0     1     0 
2 3 1     0     0     0 

    Private Employee Self Employed 
0     0    0 
1     1    0 
2     0    1 

然後可以通過groupby列和彙總max每唯一列假人:

print (data_dummies.groupby(level=0, axis=1).max()) 
    ? Government Employee Private Employee Self Employed age 
0 0     1     0    0 1 
1 0     0     1    0 1 
2 1     0     0    1 3 
+0

其實在這裏我們不遵守workclass_?但作者提到。這是什麼專欄 – venkysmarty

+0

我在最後一次編輯中添加它,現在檢查它。 – jezrael

+0

@謝謝了 – venkysmarty