0

我喜歡使用scikit的LOGO(留下一個組)作爲交叉驗證方法,並結合學習曲線。這在我處理的大多數情況下都能很好地工作,但我只能(有效地)使用這兩個參數(我相信)在這些情況下(來自經驗)最爲關鍵:最大特徵和估計數量。下面我的代碼示例:在Scikit-learn中將RandomizedSearchCV(或GridSearcCV)與LeaveOneGroupOut進行交叉驗證

Fscorer = make_scorer(f1_score, average = 'micro') 
    gp = training_data["GP"].values 
    logo = LeaveOneGroupOut() 
    from sklearn.ensemble import RandomForestClassifier 
    RF_clf100 = RandomForestClassifier (n_estimators=100, n_jobs=-1, random_state = 49) 
    RF_clf200 = RandomForestClassifier (n_estimators=200, n_jobs=-1, random_state = 49) 
    RF_clf300 = RandomForestClassifier (n_estimators=300, n_jobs=-1, random_state = 49) 
    RF_clf400 = RandomForestClassifier (n_estimators=400, n_jobs=-1, random_state = 49) 
    RF_clf500 = RandomForestClassifier (n_estimators=500, n_jobs=-1, random_state = 49) 
    RF_clf600 = RandomForestClassifier (n_estimators=600, n_jobs=-1, random_state = 49) 

    param_name = "max_features" 
    param_range = param_range = [5, 10, 15, 20, 25, 30] 


    plt.figure() 
    plt.suptitle('n_estimators = 100', fontsize=14, fontweight='bold') 
    _, test_scores = validation_curve(RF_clf100, X, y, cv=logo.split(X, y, groups=gp), 
             param_name=param_name, param_range=param_range, 
             scoring=Fscorer, n_jobs=-1) 
    test_scores_mean = np.mean(test_scores, axis=1) 
    plt.plot(param_range, test_scores_mean) 
    plt.xlabel(param_name) 
    plt.xlim(min(param_range), max(param_range)) 
    plt.ylabel("F1") 
    plt.ylim(0.47, 0.57) 
    plt.legend(loc="best") 
    plt.show() 


    plt.figure() 
    plt.suptitle('n_estimators = 200', fontsize=14, fontweight='bold') 
    _, test_scores = validation_curve(RF_clf200, X, y, cv=logo.split(X, y, groups=gp), 
             param_name=param_name, param_range=param_range, 
             scoring=Fscorer, n_jobs=-1) 
    test_scores_mean = np.mean(test_scores, axis=1) 
    plt.plot(param_range, test_scores_mean) 
    plt.xlabel(param_name) 
    plt.xlim(min(param_range), max(param_range)) 
    plt.ylabel("F1") 
    plt.ylim(0.47, 0.57) 
    plt.legend(loc="best") 
    plt.show() 
    ... 
    ... 

我真的很喜歡,雖然是將LOGO與網格搜索,或隨機搜索相結合,爲更徹底的參數空間搜索。

截至目前我的代碼看起來是這樣的:

param_dist = {"n_estimators": [100, 200, 300, 400, 500, 600], 
       "max_features": sp_randint(5, 30), 
       "max_depth": sp_randint(2, 18), 
       "criterion": ['entropy', 'gini'], 
       "min_samples_leaf": sp_randint(2, 17)} 

clf = RandomForestClassifier(random_state = 49) 

n_iter_search = 45 
random_search = RandomizedSearchCV(clf, param_distributions=param_dist, 
            n_iter=n_iter_search, 
            scoring=Fscorer, cv=8, 
            n_jobs=-1) 
random_search.fit(X, y) 

當我更換cv = 8cv=logo.split(X, y, groups=gp),我收到此錯誤信息:

--------------------------------------------------------------------------- 
TypeError         Traceback (most recent call last) 
<ipython-input-10-0092e11ffbf4> in <module>() 
---> 35 random_search.fit(X, y) 


/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in fit(self, X, y, groups) 
    1183           self.n_iter, 
    1184           random_state=self.random_state) 
-> 1185   return self._fit(X, y, groups, sampled_params) 

/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in _fit(self, X, y, groups, parameter_iterable) 
    540 
    541   X, y, groups = indexable(X, y, groups) 
--> 542   n_splits = cv.get_n_splits(X, y, groups) 
    543   if self.verbose > 0 and isinstance(parameter_iterable, Sized): 
    544    n_candidates = len(parameter_iterable) 

/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in get_n_splits(self, X, y, groups) 
    1489    Returns the number of splitting iterations in the cross-validator. 
    1490   """ 
-> 1491   return len(self.cv) # Both iterables and old-cv objects support len 
    1492 
    1493  def split(self, X=None, y=None, groups=None): 

TypeError: object of type 'generator' has no len() 

任何建議,(1)發生了什麼更重要的是,(2)我如何使它工作(將RandomizedSearchCV與LeaveOneGroupOut結合起來)?

* UPDATE 2017年2月8日*

它的工作使用cv=logorandom_search.fit(X, y, wells)

回答

1

@Vivek庫馬爾的建議,你不應該傳遞logo.split()進入RandomizedSearchCV,只傳遞一個cv對象像logo成它。 RandomizedSearchCV內部調用split()來生成列車測試指數。 您可以將您的gp組傳遞給調用RandomizedSearchCVGridSearchCV對象的fit()對象。

而不是做這個的:

random_search.fit(X, y) 

這樣做:

random_search.fit(X, y, gp) 

編輯:您還可以通過家庭醫生GridSearchCV或RandomizedSearchCV的構造函數中的參數fit_params的字典。

+0

我不知道我明白。我在哪裏傳遞'cv.get_n_splits'? – MyCarta

+0

@MyCarta對不起,我在說'logo.split()',而不是'cv.get_n_splits'。我編輯了我的答案以消除混淆。 –

+0

@ Vivek庫馬爾確定這是更清楚一點。你是否還說沒有解決方法? – MyCarta