2017-01-29 1324 views
13

我有以下代碼來測試一些sklearn Python庫中最流行的ML算法:邏輯迴歸:未知的標籤類型:「連續」使用sklearn在python

import numpy as np 
from sklearn      import metrics, svm 
from sklearn.linear_model   import LinearRegression 
from sklearn.linear_model   import LogisticRegression 
from sklearn.tree     import DecisionTreeClassifier 
from sklearn.neighbors    import KNeighborsClassifier 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 
from sklearn.naive_bayes   import GaussianNB 
from sklearn.svm     import SVC 

trainingData = np.array([ [2.3, 4.3, 2.5], [1.3, 5.2, 5.2], [3.3, 2.9, 0.8], [3.1, 4.3, 4.0] ]) 
trainingScores = np.array([3.4, 7.5, 4.5, 1.6]) 
predictionData = np.array([ [2.5, 2.4, 2.7], [2.7, 3.2, 1.2] ]) 

clf = LinearRegression() 
clf.fit(trainingData, trainingScores) 
print("LinearRegression") 
print(clf.predict(predictionData)) 

clf = svm.SVR() 
clf.fit(trainingData, trainingScores) 
print("SVR") 
print(clf.predict(predictionData)) 

clf = LogisticRegression() 
clf.fit(trainingData, trainingScores) 
print("LogisticRegression") 
print(clf.predict(predictionData)) 

clf = DecisionTreeClassifier() 
clf.fit(trainingData, trainingScores) 
print("DecisionTreeClassifier") 
print(clf.predict(predictionData)) 

clf = KNeighborsClassifier() 
clf.fit(trainingData, trainingScores) 
print("KNeighborsClassifier") 
print(clf.predict(predictionData)) 

clf = LinearDiscriminantAnalysis() 
clf.fit(trainingData, trainingScores) 
print("LinearDiscriminantAnalysis") 
print(clf.predict(predictionData)) 

clf = GaussianNB() 
clf.fit(trainingData, trainingScores) 
print("GaussianNB") 
print(clf.predict(predictionData)) 

clf = SVC() 
clf.fit(trainingData, trainingScores) 
print("SVC") 
print(clf.predict(predictionData)) 

的前兩部作品不錯,但我得到了在LogisticRegression通話以下錯誤:

[email protected]:/home/ouhma# python stack.py 
LinearRegression 
[ 15.72023529 6.46666667] 
SVR 
[ 3.95570063 4.23426243] 
Traceback (most recent call last): 
    File "stack.py", line 28, in <module> 
    clf.fit(trainingData, trainingScores) 
    File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/logistic.py", line 1174, in fit 
    check_classification_targets(y) 
    File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/multiclass.py", line 172, in check_classification_targets 
    raise ValueError("Unknown label type: %r" % y_type) 
ValueError: Unknown label type: 'continuous' 

輸入數據是一樣的,在之前的電話,所以這到底是怎麼回事呢?

順便說一下,爲什麼在LinearRegression()SVR()算法(15.72 vs 3.95)的第一個預測中存在巨大差異?

回答

19

您正在將浮點數傳遞給需要分類值作爲目標矢量的分類器。如果您將其轉換爲int它將被接受爲輸入(儘管這是否是正確的方式來執行此操作將會有疑問)。

通過使用scikit的labelEncoder函數來轉換您的培訓分數會更好。

您的DecisionTree和KNeighbors限定符也是如此。

from sklearn import preprocessing 
from sklearn import utils 

lab_enc = preprocessing.LabelEncoder() 
encoded = lab_enc.fit_transform(trainingScores) 
>>> array([1, 3, 2, 0], dtype=int64) 

print(utils.multiclass.type_of_target(trainingScores)) 
>>> continuous 

print(utils.multiclass.type_of_target(trainingScores.astype('int'))) 
>>> multiclass 

print(utils.multiclass.type_of_target(encoded)) 
>>> multiclass 
+1

謝謝!所以我必須將'2.3'轉換爲'23'等等,不是嗎?有一種使用numpy或pandas進行轉換的優雅方法? – harrison4

+1

但是,在這個例子中,輸入數據使用LogisticRegression函數具有浮點數:http://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/ ...並且它工作正常。爲什麼? – harrison4

+0

輸入可以是浮點數,但輸出需要是分類的,即int。在這個例子中,第8列只有0或1。 通常情況下,您可以使用分類標籤,例如['紅','大','生病'],你需要將其轉換爲數值。請嘗試http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features或http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html –

3

我試圖向分類器提供浮動數據時遇到同樣的問題。我想保留漂浮物而不是整數的準確性。嘗試使用迴歸算法。例如:

import numpy as np 
from sklearn import linear_model 
from sklearn import svm 

classifiers = [ 
    svm.SVR(), 
    linear_model.SGDRegressor(), 
    linear_model.BayesianRidge(), 
    linear_model.LassoLars(), 
    linear_model.ARDRegression(), 
    linear_model.PassiveAggressiveRegressor(), 
    linear_model.TheilSenRegressor(), 
    linear_model.LinearRegression()] 

trainingData = np.array([ [2.3, 4.3, 2.5], [1.3, 5.2, 5.2], [3.3, 2.9, 0.8], [3.1, 4.3, 4.0] ]) 
trainingScores = np.array([3.4, 7.5, 4.5, 1.6]) 
predictionData = np.array([ [2.5, 2.4, 2.7], [2.7, 3.2, 1.2] ]) 

for item in classifiers: 
    print(item) 
    clf = item 
    clf.fit(trainingData, trainingScores) 
    print(clf.predict(predictionData),'\n')