2016-09-09 28 views
1

我想在數據框的所有float64列上使用DataFrameMapper Imputer + Scaler映射。我的代碼與StandardScaler一起工作,但是當我添加Imputer時,映射器只返回一行全零。如何在數據框的DataFrameMapper中使用Imputer?

我看到這個問題 Imputer on some Dataframe columns in Python和教程https://github.com/paulgb/sklearn-pandas而且還有一個警告:

站點包\ sklearn \ utils的\ validation.py:386:DeprecationWarning: 傳遞一維數組數據已過時在0.17和willraise ValueError在0.19。使用X.reshape重塑數據或者(-1,1)如果您的 數據具有單個特徵或X.reshape(1,-1),如果它包含一個 單個樣品。

所以我明白,有一個形狀不匹配。下面的示例數據框應該如何重構?

import pandas as pd 
import numpy as np 
from sklearn_pandas import DataFrameMapper 
from sklearn.preprocessing import StandardScaler, Imputer 

# just a random dataframe from http://pandas.pydata.org/pandas-docs/stable/10min.html 
dates = pd.date_range('20130101', periods=6) 
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) 

print "Starting with a random dataframe of 6 rows and 4 columns of floats:" 
print df.shape 
print df 

mapping=[('A', [Imputer(), StandardScaler()]), ('C', [Imputer(), StandardScaler()])] 
mapper = DataFrameMapper(mapping) 

result = mapper.fit_transform(df) 

print "I get an unexpected result of all zeroes in just one row." 
print result.shape 
print result 

print "Expected is a dataframe of 2 columns and 6 rows of scaled floats." 
print "something like this:" 

mapping=[('A', [StandardScaler()]), ('C', [StandardScaler()])] 
mapper = DataFrameMapper(mapping) 

result_scaler = mapper.fit_transform(df) 
print result_scaler.shape 
print result_scaler 

此輸出

Starting with a random dataframe of 6 rows and 4 columns of floats. 
(6, 4) 
        A   B   C   D 
2013-01-01 -0.070551 0.039074 0.513491 -0.830585 
2013-01-02 -0.313069 -1.028936 2.359338 -0.830518 
2013-01-03 -1.264926 -0.830575 0.461515 0.427228 
2013-01-04 -0.374400 0.619986 0.318128 0.361712 
2013-01-05 -0.235587 -1.647786 -0.819940 -1.036435 
2013-01-06 1.436073 0.312183 1.566990 -0.272224 
Unexpected result is all zeroes in just one row. 
(1L, 12L) 
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]] 
Expected is a dataframe of 2 columns and 6 rows of scaled floats. 
something like this 
(6L, 2L) 
[[ 0.08306789 -0.21892275] 
[-0.21975387 1.61986719] 
[-1.40829622 -0.27069922] 
[-0.29633508 -0.4135387 ] 
[-0.12300572 -1.54725542] 
[ 1.964323 0.83054889]] 

並有後續問題 - 我原來的數據幀是浮點數,布爾值和對象(標籤)的組合。所以當我有一個列表

floats = list(df.select_dtypes(include=['float64']).columns) 
mapping=[(f, [Imputer(missing_values=0,strategy="mean"), StandardScaler()]) for f in floats] 

我怎麼能爲這些列準備數據幀(爲Imputer形狀)?

回答

1

標準Imputer不與DataFrameMapper工作,因爲在DataFrameMapper輸入/輸出的方向是什麼,預計轉置。創建圍繞Imputer包裝類應該解決的問題:

from sklearn.preprocessing import Imputer 


class SeriesImputer(Imputer): 
    def fit(self, X, y=None): 
     return super(SeriesImputer, self).fit(X.reshape(-1, 1), y=y) 

    def transform(self, X): 
     return super(SeriesImputer, self).transform(X.reshape(-1, 1)) 

然後簡單地在DataFrameMapper SeriesImputer取代Imputer的發生。

相關問題