2

我想使用交叉驗證來測試/訓練我的數據集並評估整個數據集上的邏輯迴歸模型的性能,而不僅僅是測試集(例如25%)。使用交叉驗證評估邏輯迴歸

這些概念對我來說是全新的,我不太清楚如果我做得對。如果有人能夠就我錯誤的地方採取正確的措施提供建議,我將不勝感激。我的部分代碼如下所示。

另外,如何在當前圖形的同一圖表上繪製「y2」和「y3」的ROC?

謝謝

import pandas as pd 
Data=pd.read_csv ('C:\\Dataset.csv',index_col='SNo') 
feature_cols=['A','B','C','D','E'] 
X=Data[feature_cols] 

Y=Data['Status'] 
Y1=Data['Status1'] # predictions from elsewhere 
Y2=Data['Status2'] # predictions from elsewhere 

from sklearn.linear_model import LogisticRegression 
logreg=LogisticRegression() 
logreg.fit(X_train,y_train) 

from sklearn.cross_validation import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) 

from sklearn import metrics, cross_validation 
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10) 
metrics.accuracy_score(y, predicted) 

from sklearn.cross_validation import cross_val_score 
accuracy = cross_val_score(logreg, X, y, cv=10,scoring='accuracy') 
print (accuracy) 
print (cross_val_score(logreg, X, y, cv=10,scoring='accuracy').mean()) 

from nltk import ConfusionMatrix 
print (ConfusionMatrix(list(y), list(predicted))) 
#print (ConfusionMatrix(list(y), list(yexpert))) 

# sensitivity: 
print (metrics.recall_score(y, predicted)) 

import matplotlib.pyplot as plt 
probs = logreg.predict_proba(X)[:, 1] 
plt.hist(probs) 
plt.show() 

# use 0.5 cutoff for predicting 'default' 
import numpy as np 
preds = np.where(probs > 0.5, 1, 0) 
print (ConfusionMatrix(list(y), list(preds))) 

# check accuracy, sensitivity, specificity 
print (metrics.accuracy_score(y, predicted)) 

#ROC CURVES and AUC 
# plot ROC curve 
fpr, tpr, thresholds = metrics.roc_curve(y, probs) 
plt.plot(fpr, tpr) 
plt.xlim([0.0, 1.0]) 
plt.ylim([0.0, 1.0]) 
plt.xlabel('False Positive Rate') 
plt.ylabel('True Positive Rate)') 
plt.show() 

# calculate AUC 
print (metrics.roc_auc_score(y, probs)) 

# use AUC as evaluation metric for cross-validation 
from sklearn.cross_validation import cross_val_score 
logreg = LogisticRegression() 
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean() 

回答

2

你得到了它差不多吧。 cross_validation.cross_val_predict爲您提供整個數據集的預測。您只需在代碼中刪除logreg.fit即可。具體來說,它的作用如下: 它將您的數據集分爲n個摺疊,並且在每次迭代中,它將其中一個摺疊作爲測試集並在其餘摺疊(n-1摺疊)上訓練模型。所以,最終你會得到整個數據的預測。

讓我們用sklearn,iris中的一個內置數據集來說明這一點。該數據集包含150個具有4個特徵的訓練樣本。 iris['data']Xiris['target']y

In [15]: iris['data'].shape 
Out[15]: (150, 4) 

與交叉驗證整個設置,你可以做以下要獲得預測:

from sklearn.linear_model import LogisticRegression 
from sklearn import metrics, cross_validation 
from sklearn import datasets 
iris = datasets.load_iris() 
predicted = cross_validation.cross_val_predict(LogisticRegression(), iris['data'], iris['target'], cv=10) 
print metrics.accuracy_score(iris['target'], predicted) 

Out [1] : 0.9537 

print metrics.classification_report(iris['target'], predicted) 

Out [2] : 
        precision recall f1-score support 

       0  1.00  1.00  1.00  50 
       1  0.96  0.90  0.93  50 
       2  0.91  0.96  0.93  50 

     avg/total  0.95  0.95  0.95  150 

所以,回到你的代碼。所有你需要的是這樣的:

from sklearn import metrics, cross_validation 
logreg=LogisticRegression() 
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10) 
print metrics.accuracy_score(y, predicted) 
print metrics.classification_report(y, predicted) 

對於多類分類繪製ROC,你可以按照this tutorial,讓你像下面這樣:

一般來說,sklearn具有很好的教程和文檔。我強烈建議閱讀他們的tutorial on cross_validation