我有以下數據框:錯誤FeatureUnion Sklearn管道
ID Text
1 qwerty
2 asdfgh
我想創建md5
哈希文本字段,並從上述數據幀刪除ID
場。爲了實現這一點,我創建了一個簡單的pipeline
與從sklearn
定製變壓器。
這裏是我使用的代碼:
class cust_txt_col(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def hash_generate(self, txt):
m = hashlib.md5()
text = str(txt)
long_text = ' '.join(text.split())
m.update(long_text.encode('utf-8'))
text_hash= m.hexdigest()
return text_hash
def transform(self, x):
return x[self.key].apply(lambda z: self.hash_generate(z)).values
class cust_regression_vals(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, x):
x = x.drop(['Gene', 'Variation','ID','Text'], axis=1)
return x.values
fp = pipeline.Pipeline([
('union', pipeline.FeatureUnion([
('hash', cust_txt_col('Text')), # can pass in either a pipeline
('normalized', cust_regression_vals()) # or a transformer
]))
])
當我運行此我收到follwoing錯誤:
ValueError: all the input arrays must have same number of dimensions
你能不能,請告訴我什麼是錯我的代碼?
如果我運行類逐個:
爲cust_txt_col I中得到O/P
['3e909f222a1e06098ec7ca1ea7e84540' '1691bdba3b75df145169e0501369fce3'
'1691bdba3b75df145169e0501369fce3' ..., 'e11ec9863aaeb93f77a231319021e14d'
'851c517b2af0a46cb9bc9373b748b6ff' '0ffe46fc75d21a5347b1f1a5a84526ad']
爲cust_regression_vals I中得到O/P
[[qwerty],
[asdfgh]]
不應該是'cust_txt_col(dataframe ['Text'])'?另外,如果你逐個運行類,你會得到什麼輸出? –
@ E.Z。用類o/p – Backtrack
提升了我的帖子問題可能是'cust_regression_vals'形狀;嘗試在第二個類的末尾添加'return x.ravel()。values'並驗證它是否正確。如果沒有,你可以發佈'cust_txt_col.shape'的輸出嗎? –