2017-06-10 71 views
2

我嘗試在大熊貓數據框中編碼包含分類數據("Yes""No")的許多列。完整的數據幀包含超過400列,所以我尋找一種方法來編碼所有需要的列,而不必一一編碼它們。我使用Scikit-learn LabelEncoder來對分類數據進行編碼。Sklearn標籤編碼多列大熊貓數據框

數據框的第一部分不必編碼,但是我正在尋找一種方法來直接編碼所有包含分類日期的所需列,而無需拆分並連接數據框。

爲了演示我的問題,我首先嚐試在數據框的一小部分上解決它。然而,卡在數據擬合和轉換的最後部分,並得到一個ValueError: bad input shape (4,3)。我跑的代碼:

# Create a simple dataframe resembling large dataframe 
    data = pd.DataFrame({'A': [1, 2, 3, 4], 
         'B': ["Yes", "No", "Yes", "Yes"], 
         'C': ["Yes", "No", "No", "Yes"], 
         'D': ["No", "Yes", "No", "Yes"]}) 


# Import required module 
from sklearn.preprocessing import LabelEncoder 

# Create an object of the label encoder class 
labelencoder = LabelEncoder() 

# Apply labelencoder object on columns 
labelencoder.fit_transform(data.ix[:, 1:]) # First column does not need to be encoded 

完全錯誤報告:

labelencoder.fit_transform(data.ix[:, 1:]) 
Traceback (most recent call last): 

    File "<ipython-input-47-b4986a719976>", line 1, in <module> 
    labelencoder.fit_transform(data.ix[:, 1:]) 

    File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 129, in fit_transform 
    y = column_or_1d(y, warn=True) 

    File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in column_or_1d 
    raise ValueError("bad input shape {0}".format(shape)) 

ValueError: bad input shape (4, 3) 

有誰知道如何做到這一點?

+1

標籤編碼器僅支持單個列。您需要迭代您的列以編碼它們。 –

+0

謝謝!我會研究這個問題並撰寫後續文章 – HelloBlob

回答

1

如下面的代碼,您可以通過將LabelEncoder應用於DataFrame來對多列進行編碼。但請注意,我們無法獲取所有欄目的班級信息。

import pandas as pd 
from sklearn.preprocessing import LabelEncoder 

df = pd.DataFrame({'A': [1, 2, 3, 4], 
        'B': ["Yes", "No", "Yes", "Yes"], 
        'C': ["Yes", "No", "No", "Yes"], 
        'D': ["No", "Yes", "No", "Yes"]}) 
print(df) 
# A B C D 
# 0 1 Yes Yes No 
# 1 2 No No Yes 
# 2 3 Yes No No 
# 3 4 Yes Yes Yes 

# LabelEncoder 
le = LabelEncoder() 

# apply "le.fit_transform" 
df_encoded = df.apply(le.fit_transform) 
print(df_encoded) 
# A B C D 
# 0 0 1 1 0 
# 1 1 0 0 1 
# 2 2 1 0 0 
# 3 3 1 1 1 

# Note: we cannot obtain the classes information for all columns. 
print(le.classes_) 
# ['No' 'Yes'] 
+0

爲什麼這會被低估?它適用於我... – mic

0
import pandas as pd 
from sklearn.base import BaseEstimator, TransformerMixin 
from sklearn.preprocessing import LabelBinarizer 
# df is the pandas dataframe 
class preprocessing (BaseEstimator, TransformerMixin): 
     def __init__ (self, df): 
     self.datatypes = df.dtypes.astype(str) 
     self.catcolumns = [] 
     self.cat_encoders = [] 
     self.encoded_df = [] 

     def fit (self, df, y = None): 
      for ix, val in zip(self.datatypes.index.values, 
      self.datatypes.values): 
       if val =='object': 
       self.catcolumns.append(ix) 
      fit_objs = [str(i) for i in range(len(self.catcolumns))] 
      for encs, name in zip(fit_objs,self.catcolumns): 
       encs = LabelBinarizer() 
       encs.fit(df[name]) 
       self.cat_encoders.append((name, encs)) 
      return self 
     def transform (self, df , y = None): 
      for name, encs in self.cat_encoders: 
       df_c = encs.transform(df[name]) 
       self.encoded_df.append(pd.DataFrame(df_c)) 
      self.encoded_df = pd.concat(self.encoded_df, axis = 1, 
      ignore_index 
      = True) 
      self.df_num = df.drop(self.catcolumns, axis = 1) 
      y = pd.concat([self.df_num, self.encoded_df], axis = 1, 
      ignore_index = True) 
      return y   
# use return y.values to use in sci-kit learn pipeline 
""" Finds categorical columns in a dataframe and one hot encodes the 
    columns. you can replace labelbinarizer with labelencoder if you 
    require only label encoding. Function returns encoded categorcial data 
    and numerical data as a dataframe """ 
+5

請避免給出「代碼」的答案,而是解釋你的變化/方法,以及他們如何解決OP的問題。 – GPhilo

+0

儘管此鏈接可能回答此問題,但最好在此處包含答案的重要部分,並提供供參考的鏈接。如果鏈接頁面更改,則僅鏈接答案可能會失效。 - [來自評論](/ review/low-quality-posts/18599806) – Ron

+0

如何包含說明。我似乎並不瞭解界面 – Tobi