標籤編碼器多級

我正在使用python標籤編碼器來轉換我的數據。這是我的示例數據。標籤編碼器多級

      Database  Target Market_Description Brand \ 
0   CN_Milk powder_Incl_Others NaN Shanghai Hyper total O.Brand 
1   CN_Milk powder_Incl_Others NaN Shanghai Hyper total O.Brand 
2   CN_Milk powder_Incl_Others NaN Shanghai Hyper total O.Brand 

    Sub_Brand Category     Class_Category 
0  NaN  NaN Hi Cal Adult Milk Powders- C1 
1  NaN  NaN Hi Cal Adult Milk Powders- C1 
2  NaN  NaN Hi Cal Adult Milk Powders- C1

我所有列

df3 = CountryDF.apply(preprocessing.LabelEncoder().fit_transform)

當我檢查目標列的唯一值，它說，應用轉換，

>>> print pd.unique(CountryDF.Target.ravel()) 

>>> [nan 'Elder' 'Others' 'Lady']

但是當我檢查改造後的同，我正在獲得多個級別。

>>> print pd.unique(df3.Target.ravel()) 
>>> [ 40749 667723 667725 ..., 43347 43346 43345]

我不確定這是如何工作的？我期望四個獨特的值，因爲我認爲轉換實現通過獲取唯一值併爲每個值分配排序numpy，任何人都可以幫助我理解這一點。

編輯： - 此數據集是大數據集的子集。這與這有任何關係嗎？編輯2： - @凱文我嘗試了你的建議，它的奇怪。看到這個。

來源

2016-03-15 ds_user

我不認爲大數據集會影響您的結果。 LabelEncoder的目的是改變預測目標（在你的情況下，我假設，Target列）。從User Guide：

LabelEncoder是一個實用工具類，以幫助恢復正常的標籤，使得它們包含0和n_classes-1之間僅值。

這裏有一個例子，注意到我改變的Target值在你的榜樣CountryDF，只是爲了演示的目的：

from sklearn.preprocessing import LabelEncoder 
import numpy as np 
import pandas as pd 

CountryDF = pd.DataFrame([['CN_Milk powder_Incl_Others',np.nan,'Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'], 
           ['CN_Milk powder_Incl_Others','Elder','Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'], 
           ['CN_Milk powder_Incl_Others','Others','Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'], 
           ['CN_Milk powder_Incl_Others','Lady','Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'], 
          ['CN_Milk powder_Incl_Others',np.nan,'Shanghai Hyper total','O.Brand','S_B1',np.nan,'Hi Cal Adult Milk Powders- C1'], 
          ['CN_Milk powder_Incl_Others',np.nan,'Shanghai Hyper total','O.Brand','S_B2',np.nan,'Hi Cal Adult Milk Powders- C1']], 
          columns=['Database','Target','Market_Description','Brand','Sub_Brand', 'Category','Class_Category'])

首先，初始化LabelEncoder，然後適應和轉換數據（而分配將數據轉換爲新列）。

le = LabelEncoder() # initialze the LabelEncoder once 

#Create a new column with transformed values. 
CountryDF['EncodedTarget'] = le.fit_transform(CountryDF['Target'])

注意，最後一列，EncodedTarget是Target轉化的副本。

CountryDF 

Database Target Market_Description Brand Sub_Brand Category Class_Category EncodedTarget 
0 CN_Milk powder_Incl_Others NaN  Shanghai Hyper total O.Brand  NaN  NaN  Hi Cal Adult Milk Powders- C1 0 
1 CN_Milk powder_Incl_Others Elder Shanghai Hyper total O.Brand  NaN  NaN  Hi Cal Adult Milk Powders- C1 1 
2 CN_Milk powder_Incl_Others Others Shanghai Hyper total O.Brand  NaN  NaN  Hi Cal Adult Milk Powders- C1 3 
3 CN_Milk powder_Incl_Others Lady Shanghai Hyper total O.Brand  NaN  NaN  Hi Cal Adult Milk Powders- C1 2

我希望這有助於清理LabelEncoder。如果這並不能完全解答您的問題，它可能會導致你在正確的道路走向將您的功能（這可能是你想要做什麼？） - 退房OneHotEncoder

編輯我加對CountryDF（見上面）有兩個額外的行，它有Sub_Brand列的兩個唯一值，後面跟着一系列連續的NaN。我很難理解爲什麼你會看到這種行爲，它適用於我，熊貓0.17.0和scikit 0.17。

df3 = CountryDF.apply(LabelEncoder().fit_transform) 
df3 
Database Target Market_Description Brand Sub_Brand Category Class_Category 
0 0 0 0 0 0 0 0 
1 0 1 0 0 0 1 0 
2 0 3 0 0 0 2 0 
3 0 2 0 0 0 3 0 
4 0 0 0 0 1 4 0 
5 0 0 0 0 2 5 0

我無法重現您的問題，您是否有鏈接到您的數據？

pd.unique(CountryDF.Target.ravel())  
array([nan, 'Elder', 'Others', 'Lady'], dtype=object) 
pd.unique(df3.Target.ravel()) 
array([0, 1, 3, 2])

來源

2016-03-17 22:28:18 Kevin

好的，所以我理解你的觀點。 LabelEncoder僅用於轉換預測目標。我們不能使用相同的功能來轉換功能嗎？我不知道爲什麼我的轉換創建了這個[40749 667723 667725 ...，43347 43346 43345] –

而我並不試圖預測目標列，我試圖預測「Class_Category」列。我沒有使用OneHotEncoder，主要是因爲我的功能每個都有超過190個唯一值，通過執行OneHotEncoder，這會增加很多列的數量。 –

看到編輯，我很好奇，如果最後兩行出來，如我所料，沒有數據集是很難。 – Kevin

標籤編碼器多級

回答

相關問題