2017-03-07 174 views
0

在使用教程完成了一些課程和示例之後,我嘗試創建我的第一個機器學習模型。我從這裏獲得了訓練數據:https://raw.github.com/pydata/pandas/master/pandas/tests/data/iris.csv,我正在使用熊貓來加載此csv數據。scikit學習LinearRegression字符串預測值

主要問題是預測列是字符串,所有算法都與浮點數一起使用。

當然,我可以手動映射所有字符串與數字(0,1,2),並使用更改文件,但我試圖找出一種方法來自動替換字符串值使用熊貓或scikit學習和保存它們映射在一個分離陣列。

我的代碼是:

import pandas as pd 
from sklearn.cross_validation import train_test_split 
from sklearn.linear_model import LinearRegression 

data = pd.read_csv("https://raw.github.com/pydata/pandas/master/pandas/tests/data/iris.csv") 

data.head() 

features_cols = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth'] 
X = df[features_cols] 
y = data.Name 

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1) 
linreg = LinearRegression() 
linreg.fit(X_train, y_train) 

是看到的錯誤是:

ValueError: could not convert string to float: 'Iris-setosa' 

如何我可以代替使用熊貓從整數「名稱」列中的所有值?

回答

1

可以使用scikit學習的LabelEncoder

>>> from pandas import pd 
>>> from sklearn import preprocessing 
>>> df = pd.DataFrame({'Name':['Iris-setosa','Iris-setosa','Iris-versicolor','Iris-virginica','Iris-setosa','Iris-versicolor'], 'a': [1,2,3,4,1,1]}) 
>>> y = df.Name 
>>> le = preprocessing.LabelEncoder() 
>>> le.fit(y) # fit your y array 
LabelEncoder() 
>>> le.classes_ # check your unique classes 
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object) 
>>> y_transformed = le.transform(y) # transform your y with numeric encodings 
>>> y_transformed 
array([0, 0, 1, 2, 0, 1], dtype=int64) 
-1

我建議你直接從導入iris dataset scikit學習這樣的:

from sklearn import datasets 

iris = datasets.load_iris() 
X = iris.data 
y = iris.target 

演示:

In [9]: from sklearn.cross_validation import train_test_split 

In [10]: from sklearn.linear_model import LinearRegression 

In [11]: X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) 

In [12]: linreg = LinearRegression() 

In [13]: linreg.fit(X_train, y_train) 
Out[13]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) 

In [14]: linreg.score(X_test, y_test) 
Out[14]: 0.89946565707178838 

In [15]: y 
Out[15]: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
     0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
     1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) 
+0

匿名downvoter再次襲擊... – Tonechas