2015-08-13 39 views
0

我正在使用Ridgeit線性迴歸從sickit學習。在文檔中他們表示alpha參數必須很小。嶺迴歸中的alpha參數很高

但是我在6060獲得最佳模型表現。我做錯了什麼?

下面是從文檔的描述:

alpha : {float, array-like} shape = [n_targets] Small positive values 
of alpha improve the conditioning of the problem and reduce the 
variance of the estimates. 

這裏是我的代碼:

import pandas as pd 
import numpy as np 
import custom_metrics as cmetric 
from sklearn import preprocessing 
from sklearn import cross_validation 
from sklearn import linear_model 

# Read data files: 
df_train = pd.read_csv(path + "/input/train.csv") 
df_test = pd.read_csv(path + "/input/test.csv") 

#print df.shape 
#(50999, 34) 

#convert categorical features into integers 
feature_cols_obj = [col for col in df_train.columns if df_train[col].dtypes == 'object'] 
le = preprocessing.LabelEncoder() 
for col in feature_cols_obj: 
    df_train[col] = le.fit_transform(df_train[col]) 
    df_test[col] = le.transform(df_test[col]) 

#Scale the data so that each feature has zero mean and unit std 
feature_cols = [col for col in df_train.columns if col not in ['Hazard','Id']] 
scaler = preprocessing.StandardScaler().fit(df_train[feature_cols]) 
df_train[feature_cols] = scaler.transform(df_train[feature_cols])        
df_test[feature_cols] = scaler.transform(df_test[feature_cols]) 

#polynomial features/interactions 
X_train = df_train[feature_cols] 
X_test = df_test[feature_cols] 
y = df_train['Hazard'] 
test_ids = df_test['Id'] 
poly = preprocessing.PolynomialFeatures(2) 
X_train = poly.fit_transform(X_train) 
X_test = poly.fit_transform(X_test) 

#do grid search to find best value for alpha 
#alphas = np.arange(-10,3,1)   
#clf = linear_model.RidgeCV(10**alphas) 
alphas = np.arange(100,10000,10)   
clf = linear_model.RidgeCV(alphas) 
clf.fit(X_train, y) 
print clf.alpha_ 
#clf.alpha=6060 

cv = cross_validation.KFold(df_train.shape[0], n_folds=10) 
mse = [] 
mse_train = [] 
fold_count = 0 
for train, test in cv: 
    print("Processing fold %s" % fold_count) 
    train_fold = df_train.ix[train] 
    test_fold = df_train.ix[test] 

    # Get training examples 
    X_train = train_fold[feature_cols] 
    y = train_fold['Hazard'] 
    X_test = test_fold[feature_cols] 
    #interactions 
    poly = preprocessing.PolynomialFeatures(2) 
    X_train = poly.fit_transform(X_train) 
    X_test = poly.fit_transform(X_test) 

    # Fit Ridge linear regression 
    cfr = linear_model.Ridge (alpha = 6060) 
    cfr.fit(X_train, y) 

    # Check error on test set 
    pred = cfr.predict(X_test) 

    mse.append(cmetric.normalized_gini(test_fold.Hazard, pred)) 

    # Check error on training set (Resubsitution error) 
    mse_train.append(cmetric.normalized_gini(y, cfr.predict(X_train)))  

    # Done with the fold 
    fold_count += 1 

    #print model coeff 

print cfr.coef_ 

print pd.DataFrame(mse).mean() 
#0.311794 
print pd.DataFrame(mse_train).mean() 
#.344775 

這是一組數據的統計描述: 前多項式特點:

   T1_V1   T1_V2   T1_V3   T1_V4   T1_V5 \ 
count 45899.000000 45899.000000 45899.000000 45899.000000 45899.000000 
mean  -0.000731  -0.001736  0.000183  -0.001917  0.000392 
std  1.000116  0.999538  1.000170  1.000554  0.999491 
min  -1.687746  -1.893892  -1.256792  -1.394844  -1.330461 
25%  -0.720234  -0.934764  -0.681865  -0.978753  -1.008006 
50%  -0.139727  0.184219  -0.106938  0.685608  0.281812 
75%  0.827786  0.823638  0.467988  0.685608  1.249175 
max  1.795298  1.782766  3.342622  1.517788  1.571630 

       T1_V6   T1_V7   T1_V8   T1_V9  T1_V10 \ 
count 45899.000000 45899.000000 45899.000000 45899.000000 45899.000000 
mean  0.000085  0.000574  -0.000776  0.001024  -0.000792 
std  1.000021  1.001709  0.999421  0.999460  0.999491 
min  -0.886738  -2.559151  -2.426625  -2.894427  -1.396415 
25%  -0.886738  -0.188322  -0.199566  -0.499280  -1.118270 
50%  -0.886738  -0.188322  -0.199566  -0.499280  0.272457 
75%  1.127729  -0.188322  -0.199566  0.698293  0.272457 
max  1.127729  4.553336  4.254553  3.093439  1.385038 

      ...    T2_V6   T2_V7   T2_V8   T2_V9 \ 
count  ...  45899.000000 45899.000000 45899.000000 45899.000000 
mean  ...   -0.000248  -0.002250  0.002158  -0.002376 
std  ...   1.000600  1.000546  1.009264  1.000567 
min  ...   -1.185107  -1.969111  -0.164560  -1.571220 
25%  ...   0.064723  -0.426425  -0.164560  -0.887667 
50%  ...   0.064723  0.087804  -0.164560  0.206019 
75%  ...   0.064723  1.116261  -0.164560  0.752862 
max  ...   6.313873  1.116261  10.045186  1.709837 

      T2_V10  T2_V11  T2_V12  T2_V13  T2_V14 \ 
count 45899.000000 45899.000000 45899.000000 45899.000000 45899.000000 
mean  -0.000526  -0.003068  0.000881  -0.003165  -0.000713 
std  0.999744  1.001545  1.000736  1.001126  0.999412 
min  -1.843477  -1.620956  -0.472133  -1.756894  -1.151631 
25%  -0.789013  -1.620956  -0.472133  -0.488816  -0.358019 
50%  -0.261781  0.616920  -0.472133  0.779261  -0.358019 
75%  0.792683  0.616920  -0.472133  0.779261  0.435593 
max  1.319915  0.616920  2.118047  0.779261  3.610041 

      T2_V15 
count 45899.000000 
mean  -0.001722 
std  0.998565 
min  -0.807511 
25%  -0.807511 
50%  -0.482489 
75%  0.492577 
max  2.767731 

[8 rows x 32 columns] 

多項式特徵:

  0    1    2    3    4 \ 
count 45899 45899.000000 45899.000000 45899.000000 45899.000000 
mean  1  -0.000731  -0.001736  0.000183  -0.001917 
std  0  1.000116  0.999538  1.000170  1.000554 
min  1  -1.687746  -1.893892  -1.256792  -1.394844 
25%  1  -0.720234  -0.934764  -0.681865  -0.978753 
50%  1  -0.139727  0.184219  -0.106938  0.685608 
75%  1  0.827786  0.823638  0.467988  0.685608 
max  1  1.795298  1.782766  3.342622  1.517788 

       5    6    7    8    9 \ 
count 45899.000000 45899.000000 45899.000000 45899.000000 45899.000000 
mean  0.000392  0.000085  0.000574  -0.000776  0.001024 
std  0.999491  1.000021  1.001709  0.999421  0.999460 
min  -1.330461  -0.886738  -2.559151  -2.426625  -2.894427 
25%  -1.008006  -0.886738  -0.188322  -0.199566  -0.499280 
50%  0.281812  -0.886738  -0.188322  -0.199566  -0.499280 
75%  1.249175  1.127729  -0.188322  -0.199566  0.698293 
max  1.571630  1.127729  4.553336  4.254553  3.093439 

      ...    551   552   553   554 \ 
count  ...  45899.000000 45899.000000 45899.000000 45899.000000 
mean  ...   1.001451  0.231269  0.019758  -0.015785 
std  ...   1.647125  0.796845  1.026707  0.910075 
min  ...   0.222910  -3.721184  -2.439209  -1.710345 
25%  ...   0.222910  -0.367915  -0.580348  -0.386016 
50%  ...   0.222910  -0.068564  0.169033  0.227799 
75%  ...   0.222910  0.829488  0.169033  0.381252 
max  ...   4.486123  1.650512  7.646235  5.862185 

       555   556   557   558   559 \ 
count 45899.000000 45899.000000 45899.000000 45899.000000 45899.000000 
mean  1.002242  -0.072864  0.006086  0.998802  -0.013314 
std  1.070157  1.007916  0.953547  1.768235  0.949678 
min  0.021090  -6.342458  -4.862610  0.128178  -3.187406 
25%  0.607248  -0.278991  -0.629262  0.128178  -0.351746 
50%  0.607248  -0.278991  -0.117269  0.189741  0.072986 
75%  0.607248  0.339440  0.394724  1.326255  0.289104 
max  3.086676  2.813165  2.156786  13.032392  9.991622 

       560 
count 45899.000000 
mean  0.997114 
std  1.573975 
min  0.024796 
25%  0.232795 
50%  0.652073 
75%  0.652073 
max  7.660336 

這是用於α-所述cv_values:

clf = linear_model.RidgeCV(store_cv_values =True) 
clf.fit(X_train, y) 
print clf.cv_values_ 
[[ 2.66305438e+00 2.66309171e+00 2.66347365e+00] 
[ 1.54423791e+00 1.54415884e+00 1.54339859e+00] 
[ 6.67823810e+00 6.67822709e+00 6.67821319e+00] 
..., 
[ 1.30064559e-02 1.30216638e-02 1.31734569e-02] 
[ 2.75705381e+01 2.75705980e+01 2.75713343e+01] 
[ 9.88136940e+00 9.88182038e+00 9.88626893e+00]] 
+1

多項式變換前後的特徵範圍(最小值和最大值)是多少?你能否顯示默認範圍alpha的交叉驗證分數? –

+0

@AndreasMueller我已經包含了信息。謝謝 – MAS

+1

我只是在運行RidgeCV時使用默認alpha來詢問分數。功能範圍看起來合理。 –

回答

1

這大概是overfitting的標誌;你可能想減少你的功能集。

將回歸器擬合到您的訓練集時,將使用一些特徵來擬合特徵集內的隨機變化。當在樣本外進行測試時(例如,通過您的k-fold驗證),合適的質量將會很差,因爲額外的功能是合適的噪音而不是中心傾向。較高的alpha值有助於將這些係數歸零,從而降低過度擬合的程度。

您可能希望修剪您的特徵集(消除輸入數據中的某些列),也許只需使用嶺算法對其進行嚴重加權的條件即可。另一種選擇是使用lasso迴歸器,它將小系數驅動爲零。然而,套索算法並不是一個完美的解決方案,因爲它也容易過度擬合。