1
這是我的第一篇文章。我一直在試圖將功能與FeatureUnion和Pipeline相結合,但是當我添加一個tf-idf + svd pieline時,測試失敗時出現「尺寸不匹配」錯誤。我的簡單任務是創建一個迴歸模型來預測搜索相關性。代碼和錯誤報告如下。我的代碼有問題嗎?scikit管道尺寸不匹配錯誤FeatureUnion
df = read_tsv_data(input_file)
df = tokenize(df)
df_train, df_test = train_test_split(df, test_size = 0.2, random_state=2016)
x_train = df_train['sq'].values
y_train = df_train['relevance'].values
x_test = df_test['sq'].values
y_test = df_test['relevance'].values
# char ngrams
char_ngrams = CountVectorizer(ngram_range=(2,5), analyzer='char_wb', encoding='utf-8')
# TFIDF word ngrams
tfidf_word_ngrams = TfidfVectorizer(ngram_range=(1, 4), analyzer='word', encoding='utf-8')
# SVD
svd = TruncatedSVD(n_components=100, random_state = 2016)
# SVR
svr_lin = SVR(kernel='linear', C=0.01)
pipeline = Pipeline([
('feature_union',
FeatureUnion(
transformer_list = [
('char_ngrams', char_ngrams),
('char_ngrams_svd_pipeline', make_pipeline(char_ngrams, svd)),
('tfidf_word_ngrams', tfidf_word_ngrams),
('tfidf_word_ngrams_svd', make_pipeline(tfidf_word_ngrams, svd))
]
)
),
('svr_lin', svr_lin)
])
model = pipeline.fit(x_train, y_train)
y_pred = model.predict(x_test)
當添加管道下方的FeatureUnion列表:
('tfidf_word_ngrams_svd', make_pipeline(tfidf_word_ngrams, svd))
下面生成的異常:
2016-07-31 10:34:08,712 : Testing ... Test Shape: (400,) - Training Shape: (1600,)
Traceback (most recent call last):
File "src/model/end_to_end_pipeline.py", line 236, in <module>
main()
File "src/model/end_to_end_pipeline.py", line 233, in main
process_data(input_file, output_file)
File "src/model/end_to_end_pipeline.py", line 175, in process_data
y_pred = model.predict(x_test)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/metaestimators.py", line 37, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 203, in predict
Xt = transform.transform(Xt)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 523, in transform
for name, trans in self.transformer_list)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 800, in __call__
while self.dispatch_one_batch(iterator):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 658, in dispatch_one_batch
self._dispatch(tasks)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 566, in _dispatch
job = ImmediateComputeBatch(batch)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 180, in __init__
self.results = batch()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 399, in _transform_one
return transformer.transform(X)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/metaestimators.py", line 37, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 291, in transform
Xt = transform.transform(Xt)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/decomposition/truncated_svd.py", line 201, in transform
return safe_sparse_dot(X, self.components_.T)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 179, in safe_sparse_dot
ret = a * b
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/sparse/base.py", line 389, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
謝謝你的建議。這正是問題所在。我剛剛創建了一個額外的SVD變壓器來處理tf-idf字n-grams,並且按預期工作。 – sylar