2017-10-04 24 views
3

我在sklearn中編寫了一組自定義轉換,以清理管道中的數據。每個自定義轉換都需要兩個Pandas DataFrame作爲fittransform的參數,transform也返回兩個DataFrame(請參見下面的示例)。當流水線中只有一個Transformer時,此工作正常:DataFrames in和DataFrames out。sklearn轉換操作的數據是什麼?

然而,當兩個Rransformers在管道結合時,這樣的:

pipeline = Pipeline ([ 
     ('remove_missing_columns', RemoveAllMissing (['mailing_address_str_number'])), 
     ('remove_rows_based_on_target', RemoveMissingRowsBasedOnTarget()), 
     ]) 

X, y = pipeline.fit_transform (X, y) 

==>TypeError: tuple indices must be integers or slices, not Series 

RemoveMissingRowsBasedOnTarget神祕接收的元組作爲輸入。當我切換這樣的

pipeline = Pipeline ([ 
     ('remove_rows_based_on_target', RemoveMissingRowsBasedOnTarget()), 
     ('remove_missing_columns', RemoveAllMissing (['mailing_address_str_number'])), 
     ]) 

==> AttributeError: 'tuple' object has no attribute 'apply' 

變壓器的位置在類RemoveAllMissing出現的錯誤。在這兩種情況下,錯誤信息都用錯誤發生的行上方的==>表示。我想我已經詳細閱讀了究竟到底會發生什麼,但我無法找到關於此主題的任何內容。有人能告訴我我做錯了什麼嗎?下面你會發現孤立問題的代碼。

import numpy as np 
import pandas as pd 
import random 
from sklearn.base import BaseEstimator, TransformerMixin 
from sklearn.pipeline import Pipeline 

def create_data (rows, cols, frac_nan, random_state=42): 
    random.seed (random_state) 
    X = pd.DataFrame (np.zeros ((rows, cols)), 
         columns=['col' + str(i) for i in range (cols)], 
         index=None) 
    # Create dataframe of (rows * cols) with random floating points 
    y = pd.DataFrame (np.zeros ((rows,))) 
    for row in range(rows): 
     for col in range(cols): 
      X.iloc [row,col] = random.random() 
     X.iloc [row,1] = np.nan # column 1 exists colely of NaN's 
     y.iloc [row] = random.randint (0, 1) 
    # Assign NaN's to a fraction of X 
    n = int(frac_nan * rows * cols) 
    for i in range (n): 
     row = random.randint (0, rows-1) 
     col = random.randint (0, cols-1) 
     X.iloc [row, col] = np.nan 
    # Same applies to y 
    n = int(frac_nan * rows) 
    for i in range (n): 
     row = random.randint (0, rows-1) 
     y.iloc [row,] = np.nan 

    return X, y  

class RemoveAllMissing (BaseEstimator, TransformerMixin): 
    # remove columns containg NaN only 
    def __init__ (self, requested_cols=[]): 
     self.all_missing_data = requested_cols 

    def fit (self, X, y=None): 
     # find empty columns == columns with all missing data 
     missing_cols = X.apply (lambda x: x.count(), axis=0) 
     for idx in missing_cols.index: 
      if missing_cols [idx] == 0: 
       self.all_missing_data.append (idx) 

     return self 

    def transform (self, X, y=None): 
     print (">RemoveAllMissing - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X)) 
     for all_missing_predictor in self.all_missing_data: 
      del X [all_missing_predictor] 

     print ("<RemoveAllMissing - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X)) 
     return X, y 

    def fit_transform (self, X, y=None): 
     return self.fit (X, y).transform (X, y) 

class RemoveMissingRowsBasedOnTarget (BaseEstimator, TransformerMixin): 
    # remove each row where target contains one or more NaN's 
    def __init__ (self): 
     self.missing_rows = [] 

    def fit (self, X, y = None): 
     # remove all rows where the target value is missing data 
     print (type (X)) 
     if y is None: 
      print ('RemoveMissingRowsBasedOnTarget: target (y) cannot be None') 
      return self 

     self.missing_rows = np.array (y.notnull()) # false = missing data 

     return self 

    def transform (self, X, y=None): 
     print (">RemoveMissingRowsBasedOnTarget - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X)) 
     if y is None: 
      print ('RemoveMissingRowsBasedOnTarget: target (y) cannot be None') 
      return X, y 

     X = X [self.missing_rows].reset_index() 
     del X ['index'] 
     y = y [self.missing_rows].reset_index() 
     del y ['index'] 

     print ("<RemoveMissingRowsBasedOnTarget - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X)) 
     return X, y 

    def fit_transform (self, X, y=None): 
     return self.fit (X, y).transform (X, y) 

pipeline = Pipeline ([ 
     ('RemoveAllMissing', RemoveAllMissing()), 
     ('RemoveMissingRowsBasedOnTarget', RemoveMissingRowsBasedOnTarget()), 
     ]) 

X, y = create_data (25, 10, 0.1) 
print ("X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X)) 
X, y = pipeline.fit_transform (X, y) 
#X, y = RemoveAllMissing().fit_transform (X, y) 
#X, y = RemoveMissingRowsBasedOnTarget().fit_transform (X, y) 

編輯作爲@Vivek要求我已經在那裏找到問題並運行獨立的代碼替換原來的代碼。代碼原樣會崩潰,因爲元組被傳輸爲參數而不是DataFrame。管道更改數據類型,我無法在文檔中找到它。當一個註釋掉調用管道和everyting工作正常,這樣變壓器的單獨調用之前刪除評論:

#X, y = pipeline.fit_transform (X, y) 
X, y = RemoveAllMissing().fit_transform (X, y) 
X, y = RemoveMissingRowsBasedOnTarget().fit_transform (X, y) 
+0

什麼'打印(類型(X))'在這一點上打印?(在'RemoveMissingRowsBasedOnTarget'類中,當第一個被調用時) 似乎'X'需要是一個用於調用下一個類('RemoveAllMissing')的DataFrame,但是它在那時變成了一個元組... – Eskapp

+0

這取決於調用順序:當RemoveMissingRowsBasedOnTarget被首先調用時,它會打印一個DataFrame,當它被調用時,它會打印元組。錯誤消息也抱怨沒有使用方法的元組。 – Arnold

+0

您應該添加一個完整的簡單複製代碼以及示例數據。 –

回答

2

好了,現在我已經得到了錯誤,這似乎與你的類返回兩個X,Y,而流水線可以輸入y(並沿着它的內部變換器傳遞它),它假定y始終是常量,並且永遠不會由任何transform()方法返回。你的代碼中不是這種情況。如果你可以在其他地方分開那部分,它可以工作。

參見this line in the source code of pipeline

if hasattr(transformer, 'fit_transform'): 
     res = transformer.fit_transform(X, y, **fit_params) 
    else: 
     res = transformer.fit(X, y, **fit_params).transform(X) 

您正在返回兩個值(X,Y),但其僅包含在一個單一的可變res,因此它成爲一個元組。然後在你的下一個變壓器中失效。

您可以通過解壓縮解析成X處理這些數據,Y是這樣的:

class RemoveMissingRowsBasedOnTarget (BaseEstimator, TransformerMixin): 
    ... 
    ... 

    def fit (self, X, y = None): 
     # remove all rows where the target value is missing data 
     print (type (X)) 
     if isinstance(X, tuple): 
      y=X[1] 
      X=X[0] 

     ... 
     ... 

     return self 

    def transform (self, X, y=None): 
     if isinstance(X, tuple): 
      y=X[1] 
      X=X[0] 

     ... 
     ... 

     return X, y 

    def fit_transform(self, X, y=None): 
     self.fit(X, y).transform(X, y) 

確保您在管道中的所有後續的變壓器做到這一點。但我會建議你分開X和Y處理。另外,我發現有用於將管道內部的目標變量y一些相關的問題,你可以看看:

+0

這確實有效,謝謝!我現在明白我的代碼中出現了什麼問題只出現X而不是X和Y都出現的決定出現有一點值得懷疑,尤其是在執行行操作時,在X和Y上執行它們都是很好的方法,無論如何,這是事實,謝謝你提供解決方案。 – Arnold

相關問題