Scikit-Learn GridSearch自定義評分函數

我需要在維數據集（5000,26421）上執行內核pca以獲得較低維度表示。爲了選擇組件的數量（比如說k）參數，我將數據和重構數據減少到原始空間，得到不同k值的重建和原始數據的均方誤差。Scikit-Learn GridSearch自定義評分函數

我遇到了sklearn的網格搜索功能，並希望將其用於上述參數估計。由於內核pca沒有評分函數，我實現了一個自定義評分函數並將其傳遞給Gridsearch。

from sklearn.decomposition.kernel_pca import KernelPCA 
from sklearn.model_selection import GridSearchCV 
import numpy as np 
import math 

def scorer(clf, X): 
    Y1 = clf.inverse_transform(X) 
    error = math.sqrt(np.mean((X - Y1)**2)) 
    return error 

param_grid = [ 
    {'degree': [1, 10], 'kernel': ['poly'], 'n_components': [100, 400, 100]}, 
    {'gamma': [0.001, 0.0001], 'kernel': ['rbf'], 'n_components': [100, 400, 100]}, 
] 

kpca = KernelPCA(fit_inverse_transform=True, n_jobs=30) 
clf = GridSearchCV(estimator=kpca, param_grid=param_grid, scoring=scorer) 
clf.fit(X)

但是，它會導致下面的錯誤：

/usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X=array([[ 2., 2., 1., ..., 0., 0., 0.], 
    ...., 0., 1., ..., 0., 0., 0.]], dtype=float32), Y=array([[-0.05904257, -0.02796719, 0.00919842, ....  0.00148251, -0.00311711]], dtype=float32), precomp 
uted=False, dtype=<type 'numpy.float32'>) 
    117        "for %d indexed." % 
    118        (X.shape[0], X.shape[1], Y.shape[0])) 
    119  elif X.shape[1] != Y.shape[1]: 
    120   raise ValueError("Incompatible dimension for X and Y matrices: " 
    121       "X.shape[1] == %d while Y.shape[1] == %d" % (
--> 122        X.shape[1], Y.shape[1])) 
     X.shape = (1667, 26421) 
     Y.shape = (112, 100) 
    123 
    124  return X, Y 
    125 
    126 

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 26421 while Y.shape[1] == 100

有人能指出我究竟做錯了什麼？

來源

2017-09-13 user1683894

首先，PCA具有[score（）]（http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.score）函數。第二次使用['make_scorer（）']（http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html）將自定義分數函數傳遞給gridSearch。 –

我在這種情況下不使用PCA，而是使用沒有得分功能的內核PCA。還嘗試使用make_scorer函數，但該方法無效。 – user1683894

評分函數的語法不正確。您只需要通過分類器的predicted和truth值。因此，這是你如何申報您的自定義打分函數：

def my_scorer(y_true, y_predicted): 
    error = math.sqrt(np.mean((y_true - y_predicted)**2)) 
    return error

然後你可以使用在sklearn make_scorer功能，把它傳遞給GridSearch.Be一定要設置相應的greater_is_better屬性：

Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.

我假設您正在計算錯誤，所以此屬性應設爲False，因爲錯誤越少越好：

from sklearn.metrics import make_scorer 
my_func = make_score(my_scorer,greater_is_better=False)

然後你將它傳遞給GridSearch：

GridSearchCV(estimator=my_clf, param_grid=param_grid, scoring=my_func)

哪裏my_clf是你的分類。我不認爲GridSearchCV正是你要找的。它基本上以火車和測試拆分的形式接受數據。但在這裏你只想改變你的輸入數據。您需要使用Pipeline in Sklearn。查看組合PCA和GridSearchCV的示例mentioned here。

來源

2017-09-14 04:20:52

任何人都可以解釋downvote？ –

我需要調整內核Pca的hyperparams以找到我具有最小重構錯誤的參數設置，並發現GridSearch的功能相同。在上述情況下，來自 y_predicted = kpca.fit_transform（input_data） y_true = kpca.inverse_transform（y_predicted）因此，錯誤函數中的clf參數。即使通過你的方法，我得到一個錯誤「TypeError：__call __（）至少需要4個參數（給出3）」 – user1683894

Scikit-Learn GridSearch自定義評分函數

回答

相關問題