2016-10-26 262 views
3

我想在RandomForestClassifier執行GridSearchCV,但數據是不均衡的,所以我用StratifiedKFold:GridSearchCV與StratifiedKFold

from sklearn.model_selection import StratifiedKFold 
from sklearn.grid_search import GridSearchCV 
from sklearn.ensemble import RandomForestClassifier 

param_grid = {'n_estimators':[10, 30, 100, 300], "max_depth": [3, None], 
      "max_features": [1, 5, 10], "min_samples_leaf": [1, 10, 25, 50], "criterion": ["gini", "entropy"]} 

rfc = RandomForestClassifier() 

clf = GridSearchCV(rfc, param_grid=param_grid, cv=StratifiedKFold()).fit(X_train, y_train) 

但我得到一個錯誤:

TypeError         Traceback (most recent call last) 
<ipython-input-597-b08e92c33165> in <module>() 
    9 rfc = RandomForestClassifier() 
    10 
---> 11 clf = GridSearchCV(rfc, param_grid=param_grid, cv=StratifiedKFold()).fit(X_train, y_train) 

c:\python34\lib\site-packages\sklearn\grid_search.py in fit(self, X, y) 
    811 
    812   """ 
--> 813   return self._fit(X, y, ParameterGrid(self.param_grid)) 

c:\python34\lib\site-packages\sklearn\grid_search.py in _fit(self, X, y, parameter_iterable) 
    559          self.fit_params, return_parameters=True, 
    560          error_score=self.error_score) 
--> 561     for parameters in parameter_iterable 
    562     for train, test in cv) 

c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable) 
    756    # was dispatched. In particular this covers the edge 
    757    # case of Parallel used with an exhausted iterator. 
--> 758    while self.dispatch_one_batch(iterator): 
    759     self._iterating = True 
    760    else: 

c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator) 
    601 
    602   with self._lock: 
--> 603    tasks = BatchedCalls(itertools.islice(iterator, batch_size)) 
    604    if len(tasks) == 0: 
    605     # No more tasks available in the iterator: tell caller to stop. 

c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in __init__(self, iterator_slice) 
    125 
    126  def __init__(self, iterator_slice): 
--> 127   self.items = list(iterator_slice) 
    128   self._size = len(self.items) 

c:\python34\lib\site-packages\sklearn\grid_search.py in <genexpr>(.0) 
    560          error_score=self.error_score) 
    561     for parameters in parameter_iterable 
--> 562     for train, test in cv) 
    563 
    564   # Out is a list of triplet: score, estimator, n_test_samples 

TypeError: 'StratifiedKFold' object is not iterable 

當我寫cv=StratifiedKFold(y_train)我有ValueError: The number of folds must be of Integral type.但是當我寫`cv = 5時,它可以工作。

我不明白什麼是錯的StratifiedKFold

回答

0

API中的最新版本的改變。您曾經傳遞y,現在只需在創建分層Klfold對象時傳遞數字即可。你以後通過y。

+0

我寫'CV = StratifiedKFold(10)'和得到'類型錯誤: 'StratifiedKFold' 對象不是iterable'何時應該套印Y? – user183897

+0

在當前版本中導入sklearn.model_selection.StratifiedKFold。然後你可以做cv = StratifiedKFold(10),應該沒有錯誤。但是,也許你是從前面的模塊導入,爲了兼容目的,它仍然存在,直到版本20爲止。 – simon

+0

我可以再問一個問題嗎?我從這個網站下載http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn文件scikit_learn-0.18-cp34-cp34m-win32.whl,安裝它,但現在我得到了'ImportError:DLL加載失敗:%1不是有效的Win32應用程序。 '。哪裏不對? – user183897

0

似乎cv=StratifiedKFold()).fit(X_train, y_train)應改爲cv=StratifiedKFold()).split(X_train, y_train).

+0

這與錯誤無關。這條線:clf = GridSearchCV(rfc,param_grid = param_grid,cv = StratifiedKFold())。fit(X_train,y_train)只是定義了對象clf,然後它調用fit方法來訓練/適應clf。 – sera

+0

@ rll還提到,適合應該被拆分取代。 – ebrahimi

0

這裏的問題是一個API的變化在其他的答案中提到,但答案可能會更加明確。

cv參數文檔狀態:

cv : int, cross-validation generator or an iterable, optional

Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 3-fold cross-validation, integer, to specify the number of folds.

  • An object to be used as a cross-validation generator.

  • An iterable yielding train/test splits.

For integer/None inputs, if y is binary or multiclass, StratifiedKFold used. If the estimator is a classifier or if y is neither binary nor multiclass, KFold is used.

所以,無論cross validation strategy使用,所有需要的是使用功能split提供發電機,作爲建議:

kfolds = StratifiedKFold(5) 
clf = GridSearchCV(estimator, parameters, scoring=qwk, cv=kfolds.split(xtrain,ytrain)) 
clf.fit(xtrain, ytrain) 
2

我完全一樣的問題。

爲我工作的解決方案是取代

from sklearn.grid_search import GridSearchCV 

from sklearn.model_selection import GridSearchCV 

那麼它應該工作的罰款。

0

在'0.18.1'版本的Sklearn。

GridSearchCV(estimator, param=param_grid, c=5)

實現具有5個分割一個StratifiedKFold。

文檔:

> cv : int, cross-validation generator or an iterable, optional 
>   Determines the cross-validation splitting strategy. 
>   Possible inputs for cv are: 
>   - None, to use the default 3-fold cross validation, 
>   - integer, to specify the number of folds in a `(Stratified)KFold`, 
>   - An object to be used as a cross-validation generator. 
>   - An iterable yielding train, test splits.