字典散列存儲器錯誤和特徵散列浮子錯誤

打印（X_train [numeric_predictors + categorical_predictors]。頭（））：

 bathrooms bedrooms price      building_id \ 
10   1.5  3.0 3000.0 53a5b119ba8f7b61d4e010512e0dfc85 
10000   1.0  2.0 5465.0 c5c8a357cba207596b04d1afd1e4f130 
100004  1.0  1.0 2850.0 c3ba40552e2120b0acfc3cb5730bb2aa 
100007  1.0  1.0 3275.0 28d9ad350afeaab8027513a3e52ac8d5 
100013  1.0  4.0 3350.0         0 

99993   1.0  0.0 3350.0 ad67f6181a49bde19218929b401b31b7 
99994   1.0  2.0 2200.0 5173052db6efc0caaa4d817112a70f32 


           manager_id 
10  5ba989232d0489da1b5f2c45f6688adc 
10000 7533621a882f71e25173b27e3139d83d 
100004 d9039c43983f6e564b1482b273bd7b01 
100007 1067e078446a7897d2da493d2f741316 
100013 98e13ad4b495b9613cef886d79a6291f 
... 
99993 9fd3af5b2d23951e028059e8940a55d7 
99994 d7f57128272bfd82e33a61999b5f4c42

最後兩列是分類預測指標。

同樣，在打印熊貓系列X_train [目標]：

10  medium 
10000  low 
100004  high 
100007  low 
100013  low 
... 
99993  low 
99994  low

我試圖用一個管道模板，並用散列vectorizers得到一個錯誤。

首先，這裏是我的字典散列器，給了我一個的MemoryError：

from sklearn.feature_extraction import DictVectorizer 

dv = DictVectorizer(sparse=False) 
feature_dict = X_train[categorical_predictors].to_dict(orient='records') 
dv.fit(feature_dict) 
out = pd.DataFrame(
    dv.transform(feature_dict), 
    columns = dv.feature_names_ 
)

所以在下一個單元中，我使用下面的代碼作爲我的特點散列編碼器：

from sklearn.feature_extraction import FeatureHasher 

fh = FeatureHasher(n_features=2) 
feature_dict = X_train[categorical_predictors].to_dict(orient='records') 
fh.fit(feature_dict) 
out = pd.DataFrame(fh.transform(feature_dict).toarray()) 
#print out.head()

的評論out print line爲我提供了一個DataFrame，其中包含每行2個單元格中包含-1.0，0.0或1.0浮點數的特徵行。

這裏是我的矢量器放在一起字典&功能散列：

from sklearn.base import BaseEstimator, TransformerMixin 
from sklearn.feature_extraction import FeatureHasher, DictVectorizer 

class MyVectorizer(BaseEstimator, TransformerMixin): 
    """ 
    Vectorize a set of categorical variables 
    """ 

    def __init__(self, cols, hashing=None): 
     """ 
     args: 
      cols: a list of column names of the categorical variables 
      hashing: 
       If None, then vectorization is a simple one-hot-encoding. 
       If an integer, then hashing is the number of features in the output. 
     """ 
     self.cols = cols 
     self.hashing = hashing 

    def fit(self, X, y=None): 

     data = X[self.cols] 

     # Choose a vectorizer 
     if self.hashing is None: 
      self.myvec = DictVectorizer(sparse=False) 
     else: 
      self.myvec = FeatureHasher(n_features = self.hashing) 

     self.myvec.fit(X[self.cols].to_dict(orient='records')) 
     return self 

    def transform(self, X): 

     # Vectorize Input 
     if self.hashing is None: 
      return pd.DataFrame(
       self.myvec.transform(X[self.cols].to_dict(orient='records')), 
       columns = self.myvec.feature_names_ 
      ) 
     else: 
      return pd.DataFrame(
       self.myvec.transform(X[self.cols].to_dict(orient='records')).toarray() 
      )

我把它一起在我的流水線：

from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler 
from sklearn.pipeline import FeatureUnion 

pipeline = Pipeline([ 
    ('preprocess', FeatureUnion([ 
     ('numeric', Pipeline([ 
      ('scale', StandardScaler()) 
     ]) 
     ), 
     ('categorical', Pipeline([ 
      ('vectorize', MyVectorizer(cols=['categorical_predictors'], hashing=None)) 
     ]) 
     ) 
    ])), 
    ('predict', MultinomialNB(alphas)) 
])

和alpha參數：

alphas = { 
    'predict__alpha': [.01, .1, 1, 2, 10] 
}

和使用gridsearchCV，當我在這裏的第三行得到一個錯誤時：

print X_train.head(), train_data[target] 
grid_search = GridSearchCV(pipeline, param_grid=alphas,scoring='accuracy') 
grid_search.fit(X_train[numeric_predictors + categorical_predictors], X_train[target]) 
grid_search.best_params_

ValueError異常：無法將字符串轉換爲float：d7f57128272bfd82e33a61999b5f4c42

來源

2017-08-10 Frederic Bastiat

你可以添加一些發生此錯誤的示例數據嗎？另請編輯代碼以提供完整的代碼，並按順序使用，以便我們輕鬆複製粘貼和調試。 –

嗨，我按你的建議。請看看，讓我知道，謝謝！ –

請幫助我仍然收到此錯誤。 –

的錯誤是由於StandardScaler。您正在將所有數據發送給它，這是錯誤的。在您的管線中，在FeatureUnion部分中，您已選擇MyVectorizer的分類列，但未對StandardScaler進行任何選擇，因此所有列都進入該列，這些列正在導致錯誤。另外，由於內部管線僅由單個步驟組成，因此不需要管線。

作爲第一步，改變管道：

pipeline = Pipeline([ 
    ('preprocess', FeatureUnion([ 
     ('scale', StandardScaler()), 
     ('vectorize', MyVectorizer(cols=['categorical_predictors'], hashing=None)) 
    ])), 
    ('predict', MultinomialNB()) 
])

這將仍然拋出了同樣的錯誤，但其尋找更復雜了。

現在我們所需要的是可以選擇要提供給StandardScaler的列（數字列）的東西，以便不拋出錯誤。

我們可以在很多方面做到這一點，但我會遵循您的編碼風格，並且會隨着更改而創建一個新類MyScaler。

class MyScaler(BaseEstimator, TransformerMixin): 

    def __init__(self, cols): 
     self.cols = cols 

    def fit(self, X, y=None): 

     self.scaler = StandardScaler() 
     self.scaler.fit(X[self.cols]) 
     return self 

    def transform(self, X): 
     return self.scaler.transform(X[self.cols])

，然後更改管道：

numeric_predictors=['bathrooms','bedrooms','price'] 
categorical_predictors = ['building_id','manager_id'] 

pipeline = Pipeline([ 
    ('preprocess', FeatureUnion([ 
     ('scale', MyScaler(cols=numeric_predictors)), 
     ('vectorize', MyVectorizer(cols=['categorical_predictors'], hashing=None)) 
    ])), 
    ('predict', MultinomialNB()) 
])

還是那麼它會拋出錯誤，因爲你給categorical_predictors作爲一個字符串MyVectorizer，而不是作爲一個列表。將其更改爲喜歡我的MyScaler做了：改變

MyVectorizer(cols=['categorical_predictors'], hashing=None))

到： -

MyVectorizer(cols=categorical_predictors, hashing=None)

現在你的代碼是準備好語法執行。但是現在您已經使用MultinomialNB()作爲您的預測因子，它只需要特徵中的正值。但是，由於StandardScaler將數據縮放爲零的意思，它會將一些值轉換爲負值，並且您的代碼再次失效。這件事你需要決定怎麼做..也許把它改成MinMaxScaler。

來源

2017-08-16 10:32:06

您好，我已經清理了一下，並且仍然遇到類似的問題：https://stackoverflow.com/questions/45723699/valueerror-in-pipeline-featurehasher-not-working –

字典散列存儲器錯誤和特徵散列浮子錯誤

回答

相關問題