OneHotEncoder對分類特徵的問題

我想對我的數據集中的10個特徵中的3個分類特徵進行編碼。我使用sklearn.preprocessingpreprocessing如下面這樣做：OneHotEncoder對分類特徵的問題

from sklearn import preprocessing 
cat_features = ['color', 'director_name', 'actor_2_name'] 
enc = preprocessing.OneHotEncoder(categorical_features=cat_features) 
enc.fit(dataset.values)

但是，我無法繼續，因爲我得到這個錯誤：

array = np.array(array, dtype=dtype, order=order, copy=copy) 
ValueError: could not convert string to float: PG

我很奇怪爲什麼它抱怨串因爲它應該轉換它！我在這裏錯過了什麼嗎？

來源

2017-04-24 Medo

之前，使用這些功能的 LabelEncoder

如果您閱讀OneHotEncoder的文檔，您會看到fit的輸入是「輸入int類型的數組」。所以，你需要做兩個步驟爲你的一個熱點編碼數據

from sklearn import preprocessing 
cat_features = ['color', 'director_name', 'actor_2_name'] 
enc = preprocessing.LabelEncoder() 
enc.fit(cat_features) 
new_cat_features = enc.transform(cat_features) 
print new_cat_features # [1 2 0] 
new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape 
ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read 
print ohe.fit_transform(new_cat_features)

輸出：

[[ 0. 1. 0.] 
[ 0. 0. 1.] 
[ 1. 0. 0.]]

來源

2017-04-24 13:16:45 ncfirth

從文檔：

categorical_features : 「all」 or array of indices or mask 
Specify what features are treated as categorical. 
‘all’ (default): All features are treated as categorical. 
array of indices: Array of categorical feature indices. 
mask: Array of length n_features and with dtype=bool.

大熊貓據幀的列名都不行。如果你類別特徵是列數0,2和6用途：

from sklearn import preprocessing 
cat_features = [0, 2, 6] 
enc = preprocessing.OneHotEncoder(categorical_features=cat_features) 
enc.fit(dataset.values)

還必須指出的是，如果這些類別特徵沒有標籤編碼，您需要使用OneHotEncoder

來源

2017-04-24 13:16:06

非常感謝。 – Medo

您可以同時應用轉換（從文字類整數類別，然後從整數類別到使用LabelBinarizer類一次性拍攝：

cat_features = ['color', 'director_name', 'actor_2_name'] 
encoder = LabelBinarizer() 
new_cat_features = encoder.fit_transform(cat_features) 
new_cat_features

請注意，此返回是默認密集的NumPy數組。您可以通過將 sparse_output = True傳遞給LabelBinarizer構造函數來獲得稀疏矩陣。

源Hands-On Machine Learning with Scikit-Learn and TensorFlow

來源

2017-07-21 23:21:54

如果數據集是在數據大熊貓幀，使用

pandas.get_dummies

會更簡單。

*從pandas.get_getdummies更正爲pandas.get_dummies

來源

2017-11-27 09:05:14 HappyCoding

@Medo，

我遇到了同樣的行爲，並發現它令人沮喪。正如其他人指出的那樣，在Scikit-Learn要求選擇categorical_features參數中提供的列之前，Scikit-Learn要求所有數據都是數字。

具體地，列選擇由_transform_selected()方法在/sklearn/preprocessing/data.py處理，該方法的第一行是

X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)。

如果任何的數據在所提供的數據幀X中無法成功轉換爲浮點數，則此檢查將失敗。

我同意sklearn.preprocessing.OneHotEncoder的文檔在這方面非常具有誤導性。

來源

2018-02-15 00:03:35

OneHotEncoder對分類特徵的問題

回答

相關問題