2017-10-21 91 views
3

對於下面的代碼,我的r平方分數出來爲負,但我的精度分數使用K-雙倍交叉驗證即將達到92%。這可能怎麼樣?我使用隨機森林迴歸算法來預測一些數據。該數據集的鏈接在下面的鏈接中給出: https://www.kaggle.com/ludobenistant/hr-analytics我r平方得分就要負但我的精確度得分使用k重交叉驗證即將約92%

import numpy as np 
import pandas as pd 
from sklearn.preprocessing import LabelEncoder,OneHotEncoder 

dataset = pd.read_csv("HR_comma_sep.csv") 
x = dataset.iloc[:,:-1].values ##Independent variable 
y = dataset.iloc[:,9].values  ##Dependent variable 

##Encoding the categorical variables 

le_x1 = LabelEncoder() 
x[:,7] = le_x1.fit_transform(x[:,7]) 
le_x2 = LabelEncoder() 
x[:,8] = le_x1.fit_transform(x[:,8]) 
ohe = OneHotEncoder(categorical_features = [7,8]) 
x = ohe.fit_transform(x).toarray() 


##splitting the dataset in training and testing data 

from sklearn.cross_validation import train_test_split 
y = pd.factorize(dataset['left'].values)[0].reshape(-1, 1) 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0) 

from sklearn.preprocessing import StandardScaler 
sc_x = StandardScaler() 
x_train = sc_x.fit_transform(x_train) 
x_test = sc_x.transform(x_test) 
sc_y = StandardScaler() 
y_train = sc_y.fit_transform(y_train) 

from sklearn.ensemble import RandomForestRegressor 
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0) 
regressor.fit(x_train, y_train) 

y_pred = regressor.predict(x_test) 
print(y_pred) 
from sklearn.metrics import r2_score 
r2_score(y_test , y_pred) 

from sklearn.model_selection import cross_val_score 
accuracies = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10) 
accuracies.mean() 
accuracies.std() 

回答

3

有你的問題的幾個問題...

對於初學者來說,你正在做一個很基本的錯誤:你您使用精度指標,而你在迴歸設置,下面實際使用的度量是mean squared error(MSE)。

精度在分類使用的度量,它與的正確分類的例子百分比做 - 檢查更多的細節Wikipedia條目。

在你選擇的迴歸(隨機森林)內部使用的度量包含在regressor.fit(x_train, y_train)命令的詳細輸出 - 注意到criterion='mse'論點:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, 
      max_features='auto', max_leaf_nodes=None, 
      min_impurity_split=1e-07, min_samples_leaf=1, 
      min_samples_split=2, min_weight_fraction_leaf=0.0, 
      n_estimators=10, n_jobs=1, oob_score=False, random_state=0, 
      verbose=0, warm_start=False) 

MSE是正連續量,並且它不是由1上界,也就是說,如果你得到了0.92的數值,這意味着......嗯,0.92, 92%。

知道了,這是很好的做法,明確列入MSE爲您的交叉驗證的打分函數:

cv_mse = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10, scoring='neg_mean_squared_error') 
cv_mse.mean() 
# -2.433430574463703e-28 

對於所有實用的目的,這是零 - 你適合培訓幾乎集完美;確認,這裏是你的培訓(再次完美)的R平方得分集:

train_pred = regressor.predict(x_train) 
r2_score(y_train , train_pred) 
# 1.0 

但是,一如既往,真理的時刻,當你申請你的測試集模型來;在這裏您第二的錯誤是,因爲你與縮放y_train訓練你的迴歸,你也應該評估之前規模y_test

y_test = sc_y.fit_transform(y_test) 
r2_score(y_test , y_pred) 
# 0.9998476914664215 

,你會得到一個非常漂亮的R平方在測試集(接近到1)。

怎麼樣的MSE?

from sklearn.metrics import mean_squared_error 
mse_test = mean_squared_error(y_test, y_pred) 
mse_test 
# 0.00015230853357849051 
+1

謝謝!!!!!!! –

+0

@AnantVikramSingh你很受歡迎 – desertnaut

相關問題