這是我的數據[作爲熊貓DF]:字典散列存儲器錯誤和特徵散列浮子錯誤
打印(X_train [numeric_predictors + categorical_predictors]。頭()):
bathrooms bedrooms price building_id \
10 1.5 3.0 3000.0 53a5b119ba8f7b61d4e010512e0dfc85
10000 1.0 2.0 5465.0 c5c8a357cba207596b04d1afd1e4f130
100004 1.0 1.0 2850.0 c3ba40552e2120b0acfc3cb5730bb2aa
100007 1.0 1.0 3275.0 28d9ad350afeaab8027513a3e52ac8d5
100013 1.0 4.0 3350.0 0
99993 1.0 0.0 3350.0 ad67f6181a49bde19218929b401b31b7
99994 1.0 2.0 2200.0 5173052db6efc0caaa4d817112a70f32
manager_id
10 5ba989232d0489da1b5f2c45f6688adc
10000 7533621a882f71e25173b27e3139d83d
100004 d9039c43983f6e564b1482b273bd7b01
100007 1067e078446a7897d2da493d2f741316
100013 98e13ad4b495b9613cef886d79a6291f
...
99993 9fd3af5b2d23951e028059e8940a55d7
99994 d7f57128272bfd82e33a61999b5f4c42
最後兩列是分類預測指標。
同樣,在打印熊貓系列X_train [目標]:
10 medium
10000 low
100004 high
100007 low
100013 low
...
99993 low
99994 low
我試圖用一個管道模板,並用散列vectorizers得到一個錯誤。
首先,這裏是我的字典散列器,給了我一個的MemoryError:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
feature_dict = X_train[categorical_predictors].to_dict(orient='records')
dv.fit(feature_dict)
out = pd.DataFrame(
dv.transform(feature_dict),
columns = dv.feature_names_
)
所以在下一個單元中,我使用下面的代碼作爲我的特點散列編碼器:
from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=2)
feature_dict = X_train[categorical_predictors].to_dict(orient='records')
fh.fit(feature_dict)
out = pd.DataFrame(fh.transform(feature_dict).toarray())
#print out.head()
的評論out print line爲我提供了一個DataFrame,其中包含每行2個單元格中包含-1.0,0.0或1.0浮點數的特徵行。
這裏是我的矢量器放在一起字典&功能散列:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import FeatureHasher, DictVectorizer
class MyVectorizer(BaseEstimator, TransformerMixin):
"""
Vectorize a set of categorical variables
"""
def __init__(self, cols, hashing=None):
"""
args:
cols: a list of column names of the categorical variables
hashing:
If None, then vectorization is a simple one-hot-encoding.
If an integer, then hashing is the number of features in the output.
"""
self.cols = cols
self.hashing = hashing
def fit(self, X, y=None):
data = X[self.cols]
# Choose a vectorizer
if self.hashing is None:
self.myvec = DictVectorizer(sparse=False)
else:
self.myvec = FeatureHasher(n_features = self.hashing)
self.myvec.fit(X[self.cols].to_dict(orient='records'))
return self
def transform(self, X):
# Vectorize Input
if self.hashing is None:
return pd.DataFrame(
self.myvec.transform(X[self.cols].to_dict(orient='records')),
columns = self.myvec.feature_names_
)
else:
return pd.DataFrame(
self.myvec.transform(X[self.cols].to_dict(orient='records')).toarray()
)
我把它一起在我的流水線:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion
pipeline = Pipeline([
('preprocess', FeatureUnion([
('numeric', Pipeline([
('scale', StandardScaler())
])
),
('categorical', Pipeline([
('vectorize', MyVectorizer(cols=['categorical_predictors'], hashing=None))
])
)
])),
('predict', MultinomialNB(alphas))
])
和alpha參數:
alphas = {
'predict__alpha': [.01, .1, 1, 2, 10]
}
和使用gridsearchCV,當我在這裏的第三行得到一個錯誤時:
print X_train.head(), train_data[target]
grid_search = GridSearchCV(pipeline, param_grid=alphas,scoring='accuracy')
grid_search.fit(X_train[numeric_predictors + categorical_predictors], X_train[target])
grid_search.best_params_
ValueError異常:無法將字符串轉換爲float:d7f57128272bfd82e33a61999b5f4c42
你可以添加一些發生此錯誤的示例數據嗎?另請編輯代碼以提供完整的代碼,並按順序使用,以便我們輕鬆複製粘貼和調試。 –
嗨,我按你的建議。請看看,讓我知道,謝謝! –
請幫助我仍然收到此錯誤。 –