使用帶TF-IDF的管道時CalibratedClassifierCV的錯誤？

首先感謝提前，我真的不知道我是否應該打開一個問題，所以我想檢查是否有人遇到過這個問題。使用帶TF-IDF的管道時CalibratedClassifierCV的錯誤？

所以使用CalibratedClassifierCV文本分類時，我有以下問題。我有一個估計這是一個管道這種方式創建（簡單的例子）：

# Import libraries first 
import numpy as np 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.pipeline import make_pipeline 
from sklearn.calibration import CalibratedClassifierCV 
from sklearn.linear_model import LogisticRegression 

# Now create the estimators: pipeline -> calibratedclassifier(pipeline) 
pipeline = make_pipeline(TfidfVectorizer(), LogisticRegression()) 
calibrated_pipeline = CalibratedClassifierCV(pipeline, cv=2)

現在，我們可以創建一個簡單的列車設置檢查分類工作：

# Create text and labels arrays 
text_array = np.array(['Why', 'is', 'this', 'happening']) 
outputs = np.array([0,1,0,1])

當我嘗試適合calibrated_pipeline對象，我得到這個錯誤：

ValueError: Found input variables with inconsistent numbers of samples: [1, 4]

如果你想我可以警察y整個異常追蹤，但這應該很容易重現。提前感謝！

編輯：我創建數組時犯了一個錯誤。現在固定（感謝@ogrisel！）另外，美其名曰：

pipeline.fit(text_array, outputs)

工作正常，但與標定分級這樣做失敗！

來源

2017-02-02 Iñigo Cortajarena Sauca

在報告錯誤時，應始終報告完整的回溯。很多時候，你的問題的答案就在那裏。 – ogrisel

np.array(['Why', 'is', 'this', 'happening']).reshape(-1,1)是一個字符串二維數組，而docstring of the fit_transform method of the TfidfVectorizer class指出，它預計：

Parameters 
    ---------- 
    raw_documents : iterable 
     an iterable which yields either str, unicode or file objects

如果你遍歷你的2D numpy的數組，你得到的字符串直接而不是字符串的一維數組的序列：

>>> list(text_array) 
[array(['Why'], 
     dtype='<U9'), array(['is'], 
     dtype='<U9'), array(['this'], 
     dtype='<U9'), array(['happening'], 
     dtype='<U9')]

所以修復很簡單，只需通過：

text_documents = ['Why', 'is', 'this', 'happening']

作爲向量化器的原始輸入。

編輯：備註：LogisticRegression默認情況下幾乎總是一個很好的校準分類器。這可能是CalibratedClassifierCV在這種情況下不會帶來任何東西。

來源

2017-02-02 14:57:26 ogrisel

非常感謝@ogrisel！的確，邏輯迴歸通常是很好地校準的，但這只是一個例子，在我的真實應用中，我需要使用其他的指令以及管道內的更多預處理步驟（包括自定義函數）。現在忽略這一點，你是對的，我重新塑造矢量時被誤認爲是錯誤的。然而，運行這個： '＃創建文本和標籤數組' 'text_array = np.array（['Why'，'is'，'this'，'occurrence']）' 'outputs = np.array（[ 0,1,0,1]）' 並且只在'pipeline'上調用'fit'，這個東西就可以工作，但是在校準管道中這樣做會失敗。 –

@ogrisel調用適合列表而不是數組也會給我一個錯誤，並且這樣做仍然適用於'pipeline'，但是會失敗'calibrated_pipeline'。錯誤說：'ValueError：發現輸入變量的樣本數不一致：[1，4]'。這可能是關於在校準對象與TF-IDF期望的迭代對象相沖突的情況下估計器所期望的輸入形狀？感謝您的努力！伊尼戈。 –

嗯，我認爲這可以被認爲是CallibratedClassifierCV的一個錯誤：它在輸入驗證時應該不那麼嚴格（基本上不會自行檢查將輸入委託給底層估計器）。隨意在github上打開一個問題併發出pull請求。 – ogrisel

使用帶TF-IDF的管道時CalibratedClassifierCV的錯誤？

回答

相關問題