2014-02-17 195 views
0

我正在運行Logistic迴歸,並想繪製學習曲線來獲得數據的感覺。我怎樣才能做到這一點 ?這裏是我的代碼至今:如何繪製Logistic迴歸的學習曲線?

from sklearn import metrics,preprocessing,cross_validation 
    from sklearn.feature_extraction.text import TfidfVectorizer 
    import sklearn.linear_model as lm 
    import pandas as p 
    loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ') 

    print "loading data.." 
    traindata = list(np.array(p.read_table('train.tsv'))[:,2]) 
    testdata = list(np.array(p.read_table('test.tsv'))[:,2]) 
    y = np.array(p.read_table('train.tsv'))[:,-1] 

    tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', 
     analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) 

    rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
          C=1, fit_intercept=True, intercept_scaling=1.0, 
          class_weight=None, random_state=None) 

    X_all = traindata + testdata 
    lentrain = len(traindata) 

    print "fitting pipeline" 
    tfv.fit(X_all) 
    print "transforming data" 
    X_all = tfv.transform(X_all) 

    X = X_all[:lentrain] 
    X_test = X_all[lentrain:] 

    print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc')) 

    print "training on full data" 
    rd.fit(X,y) 
    pred = rd.predict_proba(X_test)[:,1] 
    testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1) 
    pred_df = p.DataFrame(pred, index=testfile.index, columns=['label']) 
    pred_df.to_csv('benchmark.csv') 
    print "submission file created.." 

我想什麼製作的是這樣的事情,這樣我就可以有一個更好的瞭解正在發生的事情的:

Image of expected output

任何人的幫助我請這個好嗎?

回答

1

不太一般,因爲它應該是,但它會與你結束一點點擺弄做的工作。

from matplotlib import pyplot as plt 
from sklearn import metrics 
import numpy as np 

def data_size_response(model,trX,teX,trY,teY,score_func,prob=True,n_subsets=20): 

    train_errs,test_errs = [],[] 
    subset_sizes = np.exp(np.linspace(3,np.log(trX.shape[0]),n_subsets)).astype(int) 

    for m in subset_sizes: 
     model.fit(trX[:m],trY[:m]) 
     if prob: 
      train_err = score_func(trY[:m],model.predict_proba(trX[:m])) 
      test_err = score_func(teY,model.predict_proba(teX)) 
     else: 
      train_err = score_func(trY[:m],model.predict(trX[:m])) 
      test_err = score_func(teY,model.predict(teX)) 
     print "training error: %.3f test error: %.3f subset size: %.3f" % (train_err,test_err,m) 
     train_errs.append(train_err) 
     test_errs.append(test_err) 

    return subset_sizes,train_errs,test_errs 

def plot_response(subset_sizes,train_errs,test_errs): 

    plt.plot(subset_sizes,train_errs,lw=2) 
    plt.plot(subset_sizes,test_errs,lw=2) 
    plt.legend(['Training Error','Test Error']) 
    plt.xscale('log') 
    plt.xlabel('Dataset size') 
    plt.ylabel('Error') 
    plt.title('Model response to dataset size') 
    plt.show() 

model = # put your model here 
score_func = # put your scoring function here 
response = data_size_response(model,trX,teX,trY,teY,score_func,prob=True) 
plot_response(*response) 

的data_size_response功能需要一個模型(在你的情況下,實例化的LR模型),預分集(火車/試驗X和Y陣列,你可以在sklearn使用train_test_split函數生成此),以及一個評分函數作爲輸入,並在n個指數間隔子集上迭代你的數據集訓練,並返回「學習曲線」。還有一個用於可視化此響應的繪圖功能。

我也喜歡使用cross_val_score喜歡你的例子,但它需要修改sklearn源找回訓練成績,除了它已經提供了考試成績。概率論是在模型上使用某種模型/評分函數組合所必需的predict_proba vs predict方法,例如, roc_auc_score。

例情節上MNIST數據集的一個子集: enter image description here

讓我知道如果您有任何問題!