2013-07-07 50 views
1

我剛安裝了scikit 0.14,以便探索多標籤指標的改進。我用漢明損失度量和分類報告得到了一些積極的結果,但無法得到混淆矩陣的工作。同樣在分類報告中,我無法傳遞標籤數組並獲取報告中打印的標籤。以下是代碼。我做錯了什麼或者仍在開發中?scikit 0.14多標籤指標

import numpy as np 
import pandas as pd 
import random 

from sklearn import datasets 
from sklearn.pipeline import Pipeline 
from sklearn.multiclass import OneVsOneClassifier 
from sklearn.multiclass import OneVsRestClassifier 
from sklearn.svm import LinearSVC 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer 

target_names = ['New York','London', 'DC'] 

X_train = np.array(["new york is a hell of a town", 
        "new york was originally dutch", 
        "the big apple is great", 
        "new york is also called the big apple", 
        "nyc is nice", 
        "people abbreviate new york city as nyc", 
        "the capital of great britain is london", 
        "london is in the uk", 
        "london is in england", 
        "london is in great britain", 
        "it rains a lot in london", 
        "london hosts the british museum", 
        "new york is great and so is london", 
        "i like london better than new york", 
        "DC is the nations capital", 
        "DC the home of the beltway", 
        "president obama lives in Washington", 
        "The washington monument in is Washington DC"]) 

y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[1,0],[1,0],[2],[2],[2],[2]] 


X_test = np.array(['nice day in nyc', 
        'welcome to london', 
        'hello welcome to new ybrk. enjoy it here and london too', 
        'What city does the washington redskins live in?']) 
y_test = [[0],[1],[0,1],[2]]     

classifier = Pipeline([ 
         ('vectorizer', CountVectorizer(stop_words='english', 
          ngram_range=(1,3), 
          max_df = 1.0, 
          min_df = 0.1, 
          analyzer='word')), 
         ('tfidf', TfidfTransformer()), 
         ('clf', OneVsRestClassifier(LinearSVC()))]) 

classifier.fit(X_train, y_train) 

predicted = classifier.predict(X_test) 

print predicted 


for item, labels in zip(X_test, predicted): 
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels)) 



from sklearn.metrics import confusion_matrix 
from sklearn.metrics import classification_report 
from sklearn.metrics import hamming_loss 



hl = hamming_loss(y_test, predicted, target_names) 
print " " 
print " " 
print "---------------------------------------------------------" 
print "HAMMING LOSS" 
print " " 
print hl 

print " " 
print " " 
print "---------------------------------------------------------" 
print "CONFUSION MATRIX" 
print " " 
cm = confusion_matrix(y_test, predicted) 
print cm 

print " " 
print " " 
print "---------------------------------------------------------" 
print "CLASSIFICATION REPORT" 
print " " 
print classification_report(y_test, predicted) 

回答

0

多類和multilable度量能力似乎已經在2013年8月14日公佈的0.14版本中得到了改進 - scikit-learn.org/stable/whats_new.html

此外,發行558似乎爲解決一些這也是,可能在0.14,但我還沒有證實這一點 - https://github.com/scikit-learn/scikit-learn/issues/558