2015-05-08 61 views
0

我寫了下面的代碼,從文件中導入數據向量並測試SVM分類器的性能(使用sklearn和python)。SKLearn多類分類器

然而,分類器的性能比任何其他分類器都低(例如NNet對測試數據的準確度爲98%,但最多爲92%)。根據我的經驗,SVM應該爲這類數據產生更好的結果。

我可能做錯了什麼?

import numpy as np 

def buildData(featureCols, testRatio): 
    f = open("car-eval-data-1.csv") 
    data = np.loadtxt(fname = f, delimiter = ',') 

    X = data[:, :featureCols] # select columns 0:featureCols-1 
    y = data[:, featureCols] # select column featureCols 

    n_points = y.size 
    print "Imported " + str(n_points) + " lines." 

    ### split into train/test sets 
    split = int((1-testRatio) * n_points) 
    X_train = X[0:split,:] 
    X_test = X[split:,:] 
    y_train = y[0:split] 
    y_test = y[split:] 

    return X_train, y_train, X_test, y_test 

def buildClassifier(features_train, labels_train): 
    from sklearn import svm 

    #clf = svm.SVC(kernel='linear',C=1.0, gamma=0.1) 
    #clf = svm.SVC(kernel='poly', degree=3,C=1.0, gamma=0.1) 
    clf = svm.SVC(kernel='rbf',C=1.0, gamma=0.1) 
    clf.fit(features_train, labels_train) 
    return clf 

def checkAccuracy(clf, features, labels): 
    from sklearn.metrics import accuracy_score 

    pred = clf.predict(features) 
    accuracy = accuracy_score(pred, labels) 
    return accuracy 

features_train, labels_train, features_test, labels_test = buildData(6, 0.3) 
clf   = buildClassifier(features_train, labels_train) 
trainAccuracy = checkAccuracy(clf, features_train, labels_train) 
testAccuracy = checkAccuracy(clf, features_test, labels_test) 
print "Training Items: " + str(labels_train.size) + ", Test Items: " + str(labels_test.size) 
print "Training Accuracy: " + str(trainAccuracy) 
print "Test Accuracy: " + str(testAccuracy) 

i = 0 
while i < labels_test.size: 
    pred = clf.predict(features_test[i]) 
    print "F(" + str(i) + ") : " + str(features_test[i]) + " label= " + str(labels_test[i]) + " pred= " + str(pred); 
    i = i + 1 

如果默認情況下沒有做多類分類,怎麼可能做多類分類?

p.s.我的數據是下面的格式(最後一欄是類):

2,2,2,2,2,1,0 
2,2,2,2,1,2,0 
0,2,2,5,2,2,3 
2,2,2,4,2,2,1 
2,2,2,4,2,0,0 
2,2,2,4,2,1,1 
2,2,2,4,1,2,1 
0,2,2,5,2,2,3 
+2

我相信sklearn默認爲svm創建多分類分類器的一對多分類器集合。您也可以嘗試使用[GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html)來優化svm超參數。 –

+1

絕對使用GridSearchCV來調整C和伽馬,也可以使用MinMaxScaler或StandardScaler來縮放數據 –

+0

謝謝,明天我會測試它。 – wmac

回答

1

我發現這個問題很長一段時間後,我張貼,萬一有人需要它。

問題是數據導入功能不會洗牌數據。如果數據在某種程度上被排序,那麼您就有可能用一些數據來訓練分類器,並用完全不同的數據對其進行測試。在NNet的情況下,使用Matlab自動混洗輸入數據。

def buildData(filename, featureCols, testRatio): 
f = open(filename) 
data = np.loadtxt(fname = f, delimiter = ',') 
np.random.shuffle(data) # randomize the order 

X = data[:, :featureCols] # select columns 0:featureCols-1 
y = data[:, featureCols] # select column featureCols 

n_points = y.size 
print "Imported " + str(n_points) + " lines." 

### split into train/test sets 
split = int((1-testRatio) * n_points) 
X_train = X[0:split,:] 
X_test = X[split:,:] 
y_train = y[0:split] 
y_test = y[split:] 

return X_train, y_train, X_test, y_test