sklearn轉換操作的數據是什麼？

我在sklearn中編寫了一組自定義轉換，以清理管道中的數據。每個自定義轉換都需要兩個Pandas DataFrame作爲fit和transform的參數，transform也返回兩個DataFrame（請參見下面的示例）。當流水線中只有一個Transformer時，此工作正常：DataFrames in和DataFrames out。sklearn轉換操作的數據是什麼？

然而，當兩個Rransformers在管道結合時，這樣的：

pipeline = Pipeline ([ 
     ('remove_missing_columns', RemoveAllMissing (['mailing_address_str_number'])), 
     ('remove_rows_based_on_target', RemoveMissingRowsBasedOnTarget()), 
     ]) 

X, y = pipeline.fit_transform (X, y) 

==>TypeError: tuple indices must be integers or slices, not Series

類RemoveMissingRowsBasedOnTarget神祕接收的元組作爲輸入。當我切換這樣的

pipeline = Pipeline ([ 
     ('remove_rows_based_on_target', RemoveMissingRowsBasedOnTarget()), 
     ('remove_missing_columns', RemoveAllMissing (['mailing_address_str_number'])), 
     ]) 

==> AttributeError: 'tuple' object has no attribute 'apply'

變壓器的位置在類RemoveAllMissing出現的錯誤。在這兩種情況下，錯誤信息都用錯誤發生的行上方的==>表示。我想我已經詳細閱讀了究竟到底會發生什麼，但我無法找到關於此主題的任何內容。有人能告訴我我做錯了什麼嗎？下面你會發現孤立問題的代碼。

import numpy as np 
import pandas as pd 
import random 
from sklearn.base import BaseEstimator, TransformerMixin 
from sklearn.pipeline import Pipeline 

def create_data (rows, cols, frac_nan, random_state=42): 
    random.seed (random_state) 
    X = pd.DataFrame (np.zeros ((rows, cols)), 
         columns=['col' + str(i) for i in range (cols)], 
         index=None) 
    # Create dataframe of (rows * cols) with random floating points 
    y = pd.DataFrame (np.zeros ((rows,))) 
    for row in range(rows): 
     for col in range(cols): 
      X.iloc [row,col] = random.random() 
     X.iloc [row,1] = np.nan # column 1 exists colely of NaN's 
     y.iloc [row] = random.randint (0, 1) 
    # Assign NaN's to a fraction of X 
    n = int(frac_nan * rows * cols) 
    for i in range (n): 
     row = random.randint (0, rows-1) 
     col = random.randint (0, cols-1) 
     X.iloc [row, col] = np.nan 
    # Same applies to y 
    n = int(frac_nan * rows) 
    for i in range (n): 
     row = random.randint (0, rows-1) 
     y.iloc [row,] = np.nan 

    return X, y  

class RemoveAllMissing (BaseEstimator, TransformerMixin): 
    # remove columns containg NaN only 
    def __init__ (self, requested_cols=[]): 
     self.all_missing_data = requested_cols 

    def fit (self, X, y=None): 
     # find empty columns == columns with all missing data 
     missing_cols = X.apply (lambda x: x.count(), axis=0) 
     for idx in missing_cols.index: 
      if missing_cols [idx] == 0: 
       self.all_missing_data.append (idx) 

     return self 

    def transform (self, X, y=None): 
     print (">RemoveAllMissing - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X)) 
     for all_missing_predictor in self.all_missing_data: 
      del X [all_missing_predictor] 

     print ("<RemoveAllMissing - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X)) 
     return X, y 

    def fit_transform (self, X, y=None): 
     return self.fit (X, y).transform (X, y) 

class RemoveMissingRowsBasedOnTarget (BaseEstimator, TransformerMixin): 
    # remove each row where target contains one or more NaN's 
    def __init__ (self): 
     self.missing_rows = [] 

    def fit (self, X, y = None): 
     # remove all rows where the target value is missing data 
     print (type (X)) 
     if y is None: 
      print ('RemoveMissingRowsBasedOnTarget: target (y) cannot be None') 
      return self 

     self.missing_rows = np.array (y.notnull()) # false = missing data 

     return self 

    def transform (self, X, y=None): 
     print (">RemoveMissingRowsBasedOnTarget - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X)) 
     if y is None: 
      print ('RemoveMissingRowsBasedOnTarget: target (y) cannot be None') 
      return X, y 

     X = X [self.missing_rows].reset_index() 
     del X ['index'] 
     y = y [self.missing_rows].reset_index() 
     del y ['index'] 

     print ("<RemoveMissingRowsBasedOnTarget - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X)) 
     return X, y 

    def fit_transform (self, X, y=None): 
     return self.fit (X, y).transform (X, y) 

pipeline = Pipeline ([ 
     ('RemoveAllMissing', RemoveAllMissing()), 
     ('RemoveMissingRowsBasedOnTarget', RemoveMissingRowsBasedOnTarget()), 
     ]) 

X, y = create_data (25, 10, 0.1) 
print ("X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X)) 
X, y = pipeline.fit_transform (X, y) 
#X, y = RemoveAllMissing().fit_transform (X, y) 
#X, y = RemoveMissingRowsBasedOnTarget().fit_transform (X, y)

編輯作爲@Vivek要求我已經在那裏找到問題並運行獨立的代碼替換原來的代碼。代碼原樣會崩潰，因爲元組被傳輸爲參數而不是DataFrame。管道更改數據類型，我無法在文檔中找到它。當一個註釋掉調用管道和everyting工作正常，這樣變壓器的單獨調用之前刪除評論：

#X, y = pipeline.fit_transform (X, y) 
X, y = RemoveAllMissing().fit_transform (X, y) 
X, y = RemoveMissingRowsBasedOnTarget().fit_transform (X, y)

來源

2017-10-04 Arnold

什麼'打印（類型（X））'在這一點上打印？（在'RemoveMissingRowsBasedOnTarget'類中，當第一個被調用時）似乎'X'需要是一個用於調用下一個類（'RemoveAllMissing'）的DataFrame，但是它在那時變成了一個元組... – Eskapp

這取決於調用順序：當RemoveMissingRowsBasedOnTarget被首先調用時，它會打印一個DataFrame，當它被調用時，它會打印元組。錯誤消息也抱怨沒有使用方法的元組。 – Arnold

您應該添加一個完整的簡單複製代碼以及示例數據。 –

好了，現在我已經得到了錯誤，這似乎與你的類返回兩個X，Y，而流水線可以輸入y（並沿着它的內部變換器傳遞它），它假定y始終是常量，並且永遠不會由任何transform（）方法返回。你的代碼中不是這種情況。如果你可以在其他地方分開那部分，它可以工作。

參見this line in the source code of pipeline：

if hasattr(transformer, 'fit_transform'): 
     res = transformer.fit_transform(X, y, **fit_params) 
    else: 
     res = transformer.fit(X, y, **fit_params).transform(X)

您正在返回兩個值（X，Y），但其僅包含在一個單一的可變res，因此它成爲一個元組。然後在你的下一個變壓器中失效。

您可以通過解壓縮解析成X處理這些數據，Y是這樣的：

class RemoveMissingRowsBasedOnTarget (BaseEstimator, TransformerMixin): 
    ... 
    ... 

    def fit (self, X, y = None): 
     # remove all rows where the target value is missing data 
     print (type (X)) 
     if isinstance(X, tuple): 
      y=X[1] 
      X=X[0] 

     ... 
     ... 

     return self 

    def transform (self, X, y=None): 
     if isinstance(X, tuple): 
      y=X[1] 
      X=X[0] 

     ... 
     ... 

     return X, y 

    def fit_transform(self, X, y=None): 
     self.fit(X, y).transform(X, y)

確保您在管道中的所有後續的變壓器做到這一點。但我會建議你分開X和Y處理。另外，我發現有用於將管道內部的目標變量y一些相關的問題，你可以看看：

來源

2017-10-06 11:13:35

這確實有效，謝謝！我現在明白我的代碼中出現了什麼問題只出現X而不是X和Y都出現的決定出現有一點值得懷疑，尤其是在執行行操作時，在X和Y上執行它們都是很好的方法，無論如何，這是事實，謝謝你提供解決方案。 – Arnold

sklearn轉換操作的數據是什麼？

回答

相關問題