2017-05-30 64 views
1

我使用gbm()函數來創建模型,我想要得到精度。以下是我的代碼:如何計算GBM精度r

df<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE) 

str(df) 

F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21) 
for(i in F) df[,i]=as.factor(df[,i]) 

library(caret) 

set.seed(1000) 
intrain<-createDataPartition(y=df$Creditability, p=0.7, list=FALSE) 
train<-df[intrain, ] 
test<-df[-intrain, ] 

install.packages("gbm") 
library("gbm") 

df_boosting<-gbm(Creditability~.,distribution = "bernoulli", n.trees=100, verbose=TRUE, interaction.depth=4, 
       shrinkage=0.01, data=train) 
summary(df_boosting) 

yhat.boost<-predict (df_boosting ,newdata =test, n.trees=100) 
mean((yhat.boost-test$Creditability)^2) 

但是,使用匯總功能時,會出現錯誤。錯誤消息如下。

Error in plot.window(xlim, ylim, log = log, ...) : 
    유한한 값들만이 'xlim'에 사용될 수 있습니다 
In addition: Warning messages: 
1: In min(x) : no non-missing arguments to min; returning Inf 
2: In max(x) : no non-missing arguments to max; returning -Inf 

,當與平均函數測量MSE,下面的錯誤也出現:

Warning message: 
In Ops.factor(yhat.boost, test$Creditability) : 
    요인(factors)에 대하여 의미있는 ‘-’가 아닙니다. 

你知道爲什麼這兩種錯誤出現?先謝謝你。

回答

1

在您的代碼中,問題出在(二進制)響應變量Creditability的定義中。您聲明它爲factor,但gbm需要一個數字響應變量。

下面是代碼:

df <- read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE) 

F <- c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21) 
for(i in F) df[,i]=as.factor(df[,i]) 
str(df) 

Creditability現在是一個二進制數值變量:

'data.frame': 1000 obs. of 21 variables: 
$ Creditability     : int 1 1 1 1 1 1 1 1 1 1 ... 
$ Account.Balance     : Factor w/ 4 levels "1","2","3","4": 1 1 2 1 1 1 1 1 4 2 ... 
$ Duration.of.Credit..month.  : int 18 9 12 12 12 10 8 6 18 24 ... 
$ Payment.Status.of.Previous.Credit: Factor w/ 5 levels "0","1","2","3",..: 5 5 3 5 5 5 5 5 5 3 ... 
$ Purpose       : Factor w/ 10 levels "0","1","2","3",..: 3 1 9 1 1 1 1 1 4 4 ... 
... 

...和代碼的剩餘部分工作得很好:

library(caret) 
set.seed(1000) 
intrain <- createDataPartition(y=df$Creditability, p=0.7, list=FALSE) 
train <- df[intrain, ] 
test <- df[-intrain, ] 

library("gbm") 
df_boosting <- gbm(Creditability~., distribution = "bernoulli", 
     n.trees=100, verbose=TRUE, interaction.depth=4, 
     shrinkage=0.01, data=train) 
par(mar=c(3,14,1,1)) 
summary(df_boosting, las=2) 

enter image description here

########## 
                   var rel.inf 
Account.Balance          Account.Balance 36.8578980 
Credit.Amount           Credit.Amount 12.0691120 
Duration.of.Credit..month.    Duration.of.Credit..month. 10.5359895 
Purpose              Purpose 10.2691646 
Payment.Status.of.Previous.Credit Payment.Status.of.Previous.Credit 9.1296524 
Value.Savings.Stocks       Value.Savings.Stocks 4.9620662 
Instalment.per.cent        Instalment.per.cent 3.3124252 
... 
########## 

yhat.boost <- predict(df_boosting , newdata=test, n.trees=100) 
mean((yhat.boost-test$Creditability)^2) 

[1] 0.2719788 

希望這可以幫助你。

+0

爲什麼我應該改變Creditability變量的類型??它是一個由0和1組成的因子類型變量。有沒有辦法以%形式而不是MSE獲得準確性?或者,MSE是衡量準確性的唯一方法嗎? –

+0

@신익수因爲它是'gbm'的要求,所以我將'Creditability'從因子改爲數字。我沒有考慮用於計算'gbm'的預測性能的方法。無論如何,在這種情況下,MSE不是一個合適的方法。我建議使用例如基於ROC曲線的方法。 –

+0

@Macro Sandri然後,要在r中執行gbm,我是否必須將目標變量(因變量)更改爲數字? ?不是類別?但是,數據與分類有關,而不是迴歸。 –