我在sklearn
中編寫了一組自定義轉換,以清理管道中的數據。每個自定義轉換都需要兩個Pandas DataFrame作爲fit
和transform
的參數,transform
也返回兩個DataFrame(請參見下面的示例)。當流水線中只有一個Transformer時,此工作正常:DataFrames in和DataFrames out。sklearn轉換操作的數據是什麼?
然而,當兩個Rransformers在管道結合時,這樣的:
pipeline = Pipeline ([
('remove_missing_columns', RemoveAllMissing (['mailing_address_str_number'])),
('remove_rows_based_on_target', RemoveMissingRowsBasedOnTarget()),
])
X, y = pipeline.fit_transform (X, y)
==>TypeError: tuple indices must be integers or slices, not Series
類RemoveMissingRowsBasedOnTarget
神祕接收的元組作爲輸入。當我切換這樣的
pipeline = Pipeline ([
('remove_rows_based_on_target', RemoveMissingRowsBasedOnTarget()),
('remove_missing_columns', RemoveAllMissing (['mailing_address_str_number'])),
])
==> AttributeError: 'tuple' object has no attribute 'apply'
變壓器的位置在類RemoveAllMissing
出現的錯誤。在這兩種情況下,錯誤信息都用錯誤發生的行上方的==>表示。我想我已經詳細閱讀了究竟到底會發生什麼,但我無法找到關於此主題的任何內容。有人能告訴我我做錯了什麼嗎?下面你會發現孤立問題的代碼。
import numpy as np
import pandas as pd
import random
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
def create_data (rows, cols, frac_nan, random_state=42):
random.seed (random_state)
X = pd.DataFrame (np.zeros ((rows, cols)),
columns=['col' + str(i) for i in range (cols)],
index=None)
# Create dataframe of (rows * cols) with random floating points
y = pd.DataFrame (np.zeros ((rows,)))
for row in range(rows):
for col in range(cols):
X.iloc [row,col] = random.random()
X.iloc [row,1] = np.nan # column 1 exists colely of NaN's
y.iloc [row] = random.randint (0, 1)
# Assign NaN's to a fraction of X
n = int(frac_nan * rows * cols)
for i in range (n):
row = random.randint (0, rows-1)
col = random.randint (0, cols-1)
X.iloc [row, col] = np.nan
# Same applies to y
n = int(frac_nan * rows)
for i in range (n):
row = random.randint (0, rows-1)
y.iloc [row,] = np.nan
return X, y
class RemoveAllMissing (BaseEstimator, TransformerMixin):
# remove columns containg NaN only
def __init__ (self, requested_cols=[]):
self.all_missing_data = requested_cols
def fit (self, X, y=None):
# find empty columns == columns with all missing data
missing_cols = X.apply (lambda x: x.count(), axis=0)
for idx in missing_cols.index:
if missing_cols [idx] == 0:
self.all_missing_data.append (idx)
return self
def transform (self, X, y=None):
print (">RemoveAllMissing - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
for all_missing_predictor in self.all_missing_data:
del X [all_missing_predictor]
print ("<RemoveAllMissing - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
return X, y
def fit_transform (self, X, y=None):
return self.fit (X, y).transform (X, y)
class RemoveMissingRowsBasedOnTarget (BaseEstimator, TransformerMixin):
# remove each row where target contains one or more NaN's
def __init__ (self):
self.missing_rows = []
def fit (self, X, y = None):
# remove all rows where the target value is missing data
print (type (X))
if y is None:
print ('RemoveMissingRowsBasedOnTarget: target (y) cannot be None')
return self
self.missing_rows = np.array (y.notnull()) # false = missing data
return self
def transform (self, X, y=None):
print (">RemoveMissingRowsBasedOnTarget - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
if y is None:
print ('RemoveMissingRowsBasedOnTarget: target (y) cannot be None')
return X, y
X = X [self.missing_rows].reset_index()
del X ['index']
y = y [self.missing_rows].reset_index()
del y ['index']
print ("<RemoveMissingRowsBasedOnTarget - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
return X, y
def fit_transform (self, X, y=None):
return self.fit (X, y).transform (X, y)
pipeline = Pipeline ([
('RemoveAllMissing', RemoveAllMissing()),
('RemoveMissingRowsBasedOnTarget', RemoveMissingRowsBasedOnTarget()),
])
X, y = create_data (25, 10, 0.1)
print ("X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
X, y = pipeline.fit_transform (X, y)
#X, y = RemoveAllMissing().fit_transform (X, y)
#X, y = RemoveMissingRowsBasedOnTarget().fit_transform (X, y)
編輯作爲@Vivek要求我已經在那裏找到問題並運行獨立的代碼替換原來的代碼。代碼原樣會崩潰,因爲元組被傳輸爲參數而不是DataFrame。管道更改數據類型,我無法在文檔中找到它。當一個註釋掉調用管道和everyting工作正常,這樣變壓器的單獨調用之前刪除評論:
#X, y = pipeline.fit_transform (X, y)
X, y = RemoveAllMissing().fit_transform (X, y)
X, y = RemoveMissingRowsBasedOnTarget().fit_transform (X, y)
什麼'打印(類型(X))'在這一點上打印?(在'RemoveMissingRowsBasedOnTarget'類中,當第一個被調用時) 似乎'X'需要是一個用於調用下一個類('RemoveAllMissing')的DataFrame,但是它在那時變成了一個元組... – Eskapp
這取決於調用順序:當RemoveMissingRowsBasedOnTarget被首先調用時,它會打印一個DataFrame,當它被調用時,它會打印元組。錯誤消息也抱怨沒有使用方法的元組。 – Arnold
您應該添加一個完整的簡單複製代碼以及示例數據。 –