2016-10-06 35 views
1

我正在使用SVM構建分類器,並希望執行網格搜索以幫助自動查找最佳模型。下面的代碼:支持SVM的GridSearch生成IndexError

from sklearn.svm import SVC 
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import GridSearchCV 
from sklearn.multiclass import OneVsRestClassifier 

X.shape  # (22343, 323) 
y.shape  # (22343, 1) 

X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.4, random_state=0 
) 

tuned_parameters = [ 
    { 
    'estimator__kernel': ['rbf'], 
    'estimator__gamma': [1e-3, 1e-4], 
    'estimator__C': [1, 10, 100, 1000] 
    }, 
    { 
    'estimator__kernel': ['linear'], 
    'estimator__C': [1, 10, 100, 1000] 
    } 
] 

model_to_set = OneVsRestClassifier(SVC(), n_jobs=-1) 
clf = GridSearchCV(model_to_set, tuned_parameters) 
clf.fit(X_train, y_train) 

,我得到以下錯誤信息(這是不是整個堆棧跟蹤剛剛過去的3個電話。):

---------------------------------------------------- 
/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups) 
    88   X, y, groups = indexable(X, y, groups) 
    89   indices = np.arange(_num_samples(X)) 
---> 90   for test_index in self._iter_test_masks(X, y, groups): 
    91    train_index = indices[np.logical_not(test_index)] 
    92    test_index = indices[test_index] 

/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups) 
    606 
    607  def _iter_test_masks(self, X, y=None, groups=None): 
--> 608   test_folds = self._make_test_folds(X, y) 
    609   for i in range(self.n_splits): 
    610    yield test_folds == i 

/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_split.py in _make_test_folds(self, X, y, groups) 
    593   for test_fold_indices, per_cls_splits in enumerate(zip(*per_cls_cvs)): 
    594    for cls, (_, test_split) in zip(unique_y, per_cls_splits): 
--> 595     cls_test_folds = test_folds[y == cls] 
    596     # the test split can be too big because we used 
    597     # KFold(...).split(X[:max(c, n_splits)]) when data is not 100% 

IndexError: too many indices for array 

此外,當我試圖重塑陣列所以y是(22343,)我發現即使將tuned_pa​​rameters設置爲默認值,GridSearch也不會結束。

而且這裏的版本所有的軟件包是否有幫助:

的Python:3.5.2

scikit學習:0.18

大熊貓:0.19.0

+0

您是否試圖減少樣本數量並運行它? – MMF

回答

3

它似乎你的實現沒有錯誤。

但是,正如sklearn文檔中提到的那樣,「擬合時間複雜度超過二次樣本數,因此樣本數很難通過多個10000樣本縮放到數據集」。 See documentation here

對於您的情況,您有22343樣本,這可能會導致一些計算問題/內存問題。這就是爲什麼當你做你的默認CV時,需要很多時間。嘗試減少您的火車設置使用10000樣本或更少。