2014-09-24 23 views
3

我正在嘗試在十個交叉驗證的每個人中的每個人中進行最佳超參數GridSearch,它與我以前的多類分類工作很好地工作,但不是這種情況這次與多標籤工作。GridSearch for Scikit-learn中的多標籤分類

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) 
clf = OneVsRestClassifier(LinearSVC()) 

C_range = 10.0 ** np.arange(-2, 9) 
param_grid = dict(estimator__clf__C = C_range) 

clf = GridSearchCV(clf, param_grid) 
clf.fit(X_train, y_train) 

我收到錯誤:

--------------------------------------------------------------------------- 
ValueError        Traceback (most recent call last) 
<ipython-input-65-dcf9c1d2e19d> in <module>() 
     6 
     7 clf = GridSearchCV(clf, param_grid) 
----> 8 clf.fit(X_train, y_train) 

/usr/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y) 
    595 
    596   """ 
--> 597   return self._fit(X, y, ParameterGrid(self.param_grid)) 
    598 
    599 

/usr/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y, 
parameter_iterable) 
    357         % (len(y), n_samples)) 
    358    y = np.asarray(y) 
--> 359   cv = check_cv(cv, X, y, classifier=is_classifier(estimator)) 
    360 
    361   if self.verbose > 0: 

/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _check_cv(cv, X, 
y, classifier, warn_mask) 
    1365    needs_indices = None 
    1366   if classifier: 
-> 1367    cv = StratifiedKFold(y, cv, indices=needs_indices) 
    1368   else: 
    1369    if not is_sparse: 

/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self, 
y, n_folds, indices, shuffle, random_state) 
    427   for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)): 
    428    for label, (_, test_split) in zip(unique_labels, per_label_splits): 
--> 429     label_test_folds = test_folds[y == label] 
    430     # the test split can be too big because we used 
    431     # KFold(max(c, self.n_folds), self.n_folds) instead of 

ValueError: boolean index array should have 1 dimension 

這可能指的是尺寸或標籤指示的格式。

print X_train.shape, y_train.shape 

得到:

(147, 1024) (147, 6) 

似乎GridSearch工具StratifiedKFold本質。 這個問題在分層K摺疊策略中引入了多標籤問題。

StratifiedKFold(y_train, 10) 

ValueError        Traceback (most recent call last) 
<ipython-input-87-884ffeeef781> in <module>() 
----> 1 StratifiedKFold(y_train, 10) 

/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self, 
y, n_folds, indices, shuffle, random_state) 
    427   for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)): 
    428    for label, (_, test_split) in zip(unique_labels, per_label_splits): 
--> 429     label_test_folds = test_folds[y == label] 
    430     # the test split can be too big because we used 
    431     # KFold(max(c, self.n_folds), self.n_folds) instead of 

ValueError: boolean index array should have 1 dimension 

目前使用的傳統的K-倍戰略工作正常。 是否有任何方法實施分層K摺疊到多標籤分類?

回答

4

網格搜索對於分類問題執行stratified cross-validation,但對於多標籤任務,這沒有實現;實際上,多標籤分層是機器學習中未解決的問題。我最近面臨同樣的問題,我能找到的所有文獻都是this article中提出的一種方法(其中的作者稱他們無法找到解決這個問題的任何其他嘗試)。

+0

謝謝你的評論,我注意到了問題,並更新了線程。也感謝分享紙張,我會通過它。 – Francis 2014-09-26 06:06:53

+0

我剛想出一個想法,對每個類別樣本進行分層分割都沒有意義嗎?既然'GridSearchCV'正在完成'OneVsRestClassifier',爲什麼它不能單獨處理每個類樣本以產生'L'二元問題,因此可以對每個'L'進行分層分割? – Francis 2015-11-18 14:29:56

0

正如Fred Foo所指出的那樣,多標籤任務沒有實現分層交叉驗證。一種替代方法是在轉換的標籤空間中使用StratifiedKFold類scikit-learn,如建議here

以下是示例python代碼。

from sklearn.model_selection import StratifiedKFold 
kf = StratifiedKFold(n_splits=n_splits, random_state=None, shuffle=shuffle) 


for train_index, test_index in kf.split(X, lp.transform(y)): 
    X_train = X[train_index,:] 
    y_train = y[train_index,:] 

    X_test = X[test_index,:] 
    y_test = y[test_index,:] 

    # learn the classifier 
    classifier.fit(X_train, y_train) 

    # predict labels for test data 
    predictions = classifier.predict(X_test)