在Sklearn中運行SVM時的值錯誤

我使用numpy數組做支持向量機的問題如下。在Sklearn中運行SVM時的值錯誤

import numpy as np 
from sklearn import svm

我有3類/標籤（male，female，na），表示如下：

labels = [0,1,2]

每個類是由3個變量（height，weight，age）作爲訓練數據定義：

male_height = np.array([111,121,137,143,157]) 
male_weight = np.array([60,70,88,99,75]) 
male_age = np.array([41,32,73,54,35]) 

males = np.hstack([male_height,male_weight,male_age]) 

female_height = np.array([91,121,135,98,90]) 
female_weight = np.array([32,67,98,86,56]) 
female_age = np.array([51,35,33,67,61]) 

females = np.hstack([female_height,female_weight,female_age]) 

na_height = np.array([96,127,145,99,91]) 
na_weight = np.array([42,97,78,76,86]) 
na_age = np.array([56,35,49,64,66]) 

nas = np.hstack([na_height,na_weight,na_age])

現在我必須擬合支持向量機方法f或訓練數據來預測類給出的三個變量：

height_weight_age = [100,100,100] 

clf = svm.SVC() 
trainingData = np.vstack([males,females,nas]) 

clf.fit(trainingData, labels) 

result = clf.predict(height_weight_age) 

print result

不幸的是，出現以下錯誤：

ValueError: X.shape[1] = 3 should be equal to 15, the number of features at training time

我應該如何修改trainingData和labels，以得到正確的答案？

來源

2014-10-12 jean

@jonrsharpe感謝編輯我的原始問題，很好！ – jean 2014-10-12 14:13:45

hstack給出了一維數組。您需要形狀爲(n_samples, n_features)的二維陣列，您可以從vstack獲取這些陣列。

In [7]: males = np.hstack([male_height,male_weight,male_age]) 

In [8]: males 
Out[8]: 
array([111, 121, 137, 143, 157, 60, 70, 88, 99, 75, 41, 32, 73, 
     54, 35]) 

In [9]: np.vstack([male_height,male_weight,male_age]) 
Out[9]: 
array([[111, 121, 137, 143, 157], 
     [ 60, 70, 88, 99, 75], 
     [ 41, 32, 73, 54, 35]]) 

In [10]: np.vstack([male_height,male_weight,male_age]).T 
Out[10]: 
array([[111, 60, 41], 
     [121, 70, 32], 
     [137, 88, 73], 
     [143, 99, 54], 
     [157, 75, 35]])

您還需要傳遞反映每個樣本標籤的標籤列表/數組，而不僅僅是枚舉存在的標籤。固定所有變量後，我可以訓練SVM和如下應用它。

In [19]: clf = svm.SVC() 

In [20]: y = ["male"] * 5 + ["female"] * 5 + ["na"] * 5 

In [21]: X = np.vstack([males, females, nas]) 

In [22]: clf.fit(X, y) 
Out[22]: 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, 
    kernel='rbf', max_iter=-1, probability=False, random_state=None, 
    shrinking=True, tol=0.001, verbose=False) 

In [23]: height_weight_age = [100,100,100] 

In [24]: clf.predict(height_weight_age) 
Out[24]: 
array(['female'], 
     dtype='|S6')

（請注意，我使用的字符串標籤而不是用數字我也勸你了標準化的特徵值，因爲他們有相當不同的範圍。）

來源

2014-10-12 14:21:50

我總是認爲我需要在特徵值變化數個數量級時進行標準化......即使在像這樣的情況下，它們之間的差異依次爲（微小的）常數，您會發現它有幫助嗎？ – Fred 2014-10-12 16:08:51

@Fred這當然值得一試。它還改變了「C」（正則化）參數的規模，使其更容易處理。 – 2014-10-12 16:42:39

@Fred什麼是標準化？我該怎麼做？例如：@ larsmans – jean 2014-10-13 01:13:30

在Sklearn中運行SVM時的值錯誤

回答

相關問題