2015-04-16 131 views
6

我想爲我自己的數據集繪製多類別案例的ROC曲線。通過documentation我讀到標籤必須被二進制(我有5個標籤,從1到5),所以我也跟着文檔中提供的示例:如何使用scikit繪製ROC曲線來學習多類別案例?

print(__doc__) 

import numpy as np 
import matplotlib.pyplot as plt 
from sklearn import svm, datasets 
from sklearn.metrics import roc_curve, auc 
from sklearn.cross_validation import train_test_split 
from sklearn.preprocessing import label_binarize 
from sklearn.svm import SVC 
from sklearn.multiclass import OneVsRestClassifier 



from sklearn.feature_extraction.text import TfidfVectorizer 
import numpy as np 
tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2)) 
from sklearn.cross_validation import train_test_split, cross_val_score 

import pandas as pd 

df = pd.read_csv('path/file.csv', 
        header=0, sep=',', names=['id', 'content', 'label']) 


X = tfidf_vect.fit_transform(df['content'].values) 
y = df['label'].values 




# Binarize the output 
y = label_binarize(y, classes=[1,2,3,4,5]) 
n_classes = y.shape[1] 

# Add noisy features to make the problem harder 
random_state = np.random.RandomState(0) 
n_samples, n_features = X.shape 
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)] 

# shuffle and split training and test sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33 
                ,random_state=0) 

# Learn to predict each class against the other 
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, 
           random_state=random_state)) 
y_score = classifier.fit(X_train, y_train).decision_function(X_test) 

# Compute ROC curve and ROC area for each class 
fpr = dict() 
tpr = dict() 
roc_auc = dict() 
for i in range(n_classes): 
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i]) 
    roc_auc[i] = auc(fpr[i], tpr[i]) 

# Compute micro-average ROC curve and ROC area 
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel()) 
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"]) 

# Plot of a ROC curve for a specific class 
plt.figure() 
plt.plot(fpr[2], tpr[2], label='ROC curve (area = %0.2f)' % roc_auc[2]) 
plt.plot([0, 1], [0, 1], 'k--') 
plt.xlim([0.0, 1.0]) 
plt.ylim([0.0, 1.05]) 
plt.xlabel('False Positive Rate') 
plt.ylabel('True Positive Rate') 
plt.title('Receiver operating characteristic example') 
plt.legend(loc="lower right") 
plt.show() 

# Plot ROC curve 
plt.figure() 
plt.plot(fpr["micro"], tpr["micro"], 
     label='micro-average ROC curve (area = {0:0.2f})' 
       ''.format(roc_auc["micro"])) 
for i in range(n_classes): 
    plt.plot(fpr[i], tpr[i], label='ROC curve of class {0} (area = {1:0.2f})' 
            ''.format(i, roc_auc[i])) 

plt.plot([0, 1], [0, 1], 'k--') 
plt.xlim([0.0, 1.0]) 
plt.ylim([0.0, 1.05]) 
plt.xlabel('False Positive Rate') 
plt.ylabel('True Positive Rate') 
plt.title('Some extension of Receiver operating characteristic to multi-class') 
plt.legend(loc="lower right") 
plt.show() 

這樣做的問題是,這種形式給出寫不完。任何關於如何繪製這個dataset的ROC曲線的想法?

+3

我認爲你有一個概念上的錯誤。除了兩個班以外,ROC確實沒有定義。 – carlosdc

+0

感謝您的反饋@carlosdc。當然,這隻適用於二元分類案例。所以這是不可能的? –

+1

您可以爲每對類別做成對的ROC曲線。 – Scott

回答

3

這個版本無法完成,因爲這條線:

classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, random_state=random_state)) 

SVM分類需要很長的時間來完成,使用不同的分級類似的AdaBoost或其他您所選擇的:

classifier = OneVsRestClassifier(AdaBoostClassifier()) 

記住添加進口:

from sklearn.ensemble import AdaBoostClassifier 

刪除此代碼,它是無用的:

# Add noisy features to make the problem harder 
random_state = np.random.RandomState(0) 
n_samples, n_features = X.shape 
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)] 

而是隻加:

random_state = 0 
+0

感謝您的幫助,爲什麼這與SVMs有很大關係? –

+3

這是因爲您將概率設置爲True。在這種情況下,svm也必須計算概率,這是內存和計算密集型的。 – Salamander

+0

@Eranyogev你如何用交叉驗證對多類進行繪製? – Bambi