我想在數據框的所有float64列上使用DataFrameMapper Imputer + Scaler映射。我的代碼與StandardScaler一起工作,但是當我添加Imputer時,映射器只返回一行全零。如何在數據框的DataFrameMapper中使用Imputer?
我看到這個問題 Imputer on some Dataframe columns in Python和教程https://github.com/paulgb/sklearn-pandas而且還有一個警告:
站點包\ sklearn \ utils的\ validation.py:386:DeprecationWarning: 傳遞一維數組數據已過時在0.17和willraise ValueError在0.19。使用X.reshape重塑數據或者(-1,1)如果您的 數據具有單個特徵或X.reshape(1,-1),如果它包含一個 單個樣品。
所以我明白,有一個形狀不匹配。下面的示例數據框應該如何重構?
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler, Imputer
# just a random dataframe from http://pandas.pydata.org/pandas-docs/stable/10min.html
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print "Starting with a random dataframe of 6 rows and 4 columns of floats:"
print df.shape
print df
mapping=[('A', [Imputer(), StandardScaler()]), ('C', [Imputer(), StandardScaler()])]
mapper = DataFrameMapper(mapping)
result = mapper.fit_transform(df)
print "I get an unexpected result of all zeroes in just one row."
print result.shape
print result
print "Expected is a dataframe of 2 columns and 6 rows of scaled floats."
print "something like this:"
mapping=[('A', [StandardScaler()]), ('C', [StandardScaler()])]
mapper = DataFrameMapper(mapping)
result_scaler = mapper.fit_transform(df)
print result_scaler.shape
print result_scaler
此輸出
Starting with a random dataframe of 6 rows and 4 columns of floats.
(6, 4)
A B C D
2013-01-01 -0.070551 0.039074 0.513491 -0.830585
2013-01-02 -0.313069 -1.028936 2.359338 -0.830518
2013-01-03 -1.264926 -0.830575 0.461515 0.427228
2013-01-04 -0.374400 0.619986 0.318128 0.361712
2013-01-05 -0.235587 -1.647786 -0.819940 -1.036435
2013-01-06 1.436073 0.312183 1.566990 -0.272224
Unexpected result is all zeroes in just one row.
(1L, 12L)
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Expected is a dataframe of 2 columns and 6 rows of scaled floats.
something like this
(6L, 2L)
[[ 0.08306789 -0.21892275]
[-0.21975387 1.61986719]
[-1.40829622 -0.27069922]
[-0.29633508 -0.4135387 ]
[-0.12300572 -1.54725542]
[ 1.964323 0.83054889]]
並有後續問題 - 我原來的數據幀是浮點數,布爾值和對象(標籤)的組合。所以當我有一個列表
floats = list(df.select_dtypes(include=['float64']).columns)
mapping=[(f, [Imputer(missing_values=0,strategy="mean"), StandardScaler()]) for f in floats]
我怎麼能爲這些列準備數據幀(爲Imputer形狀)?