我試圖使用XGBoost來模擬從不等長度曝光周期產生的數據的索賠頻率,但一直無法獲得正確處理曝光的模型。我通常通過設置日誌(曝光)作爲偏移量來做到這一點 - 你能夠在XGBoost中做到這一點?XGBoost - 具有不同曝光/偏移量的泊松分佈
(A類似的問題被張貼在這裏:xgboost, offset exposure?)
爲了說明問題,將R代碼下面生成與字段的一些數據:
- X1,X2 - 因子(0或1 )
- 曝光 - 策略週期的長度上觀察到的數據
- 頻率 - 意味着每單位曝光權利要求
- 權利要求中的數 - 觀察到的權利要求的數s〜泊松(頻率*曝光)
目標是使用x1和x2預測頻率 - 如果x1 = x2 = 1,則真實模型爲:頻率= 2,否則頻率= 1。
曝光不能用於預測頻率,因爲它不是一開始就已知的政策。我們可以使用它的唯一方法是說:期望的索賠數量=頻率*風險。
的代碼試圖預測此使用XGBoost由:
- 設置曝光作爲模型矩陣
- 設置日誌(曝光)的重量作爲偏移
下面這些,我已經顯示了我將如何處理樹(rpart)或gbm的情況。
set.seed(1)
size<-10000
d <- data.frame(
x1 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
x2 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
exposure = runif(size, 1, 10)*0.3
)
d$frequency <- 2^(d$x1==1 & d$x2==1)
d$claims <- rpois(size, lambda = d$frequency * d$exposure)
#### Try to fit using XGBoost
require(xgboost)
param0 <- list(
"objective" = "count:poisson"
, "eval_metric" = "logloss"
, "eta" = 1
, "subsample" = 1
, "colsample_bytree" = 1
, "min_child_weight" = 1
, "max_depth" = 2
)
## 1 - set weight in xgb.Matrix
xgtrain = xgb.DMatrix(as.matrix(d[,c("x1","x2")]), label = d$claims, weight = d$exposure)
xgb = xgb.train(
nrounds = 1
, params = param0
, data = xgtrain
)
d$XGB_P_1 <- predict(xgb, xgtrain)
## 2 - set as offset in xgb.Matrix
xgtrain.mf <- model.frame(as.formula("claims~x1+x2+offset(log(exposure))"),d)
xgtrain.m <- model.matrix(attr(xgtrain.mf,"terms"),data = d)
xgtrain <- xgb.DMatrix(xgtrain.m,label = d$claims)
xgb = xgb.train(
nrounds = 1
, params = param0
, data = xgtrain
)
d$XGB_P_2 <- predict(model, xgtrain)
#### Fit a tree
require(rpart)
d[,"tree_response"] <- cbind(d$exposure,d$claims)
tree <- rpart(tree_response ~ x1 + x2,
data = d,
method = "poisson")
d$Tree_F <- predict(tree, newdata = d)
#### Fit a GBM
gbm <- gbm(claims~x1+x2+offset(log(exposure)),
data = d,
distribution = "poisson",
n.trees = 1,
shrinkage=1,
interaction.depth=2,
bag.fraction = 0.5)
d$GBM_F <- predict(gbm, newdata = d, n.trees = 1, type="response")
感謝榮。這是我嘗試過的選項之一,但在簡單情況下似乎沒有按預期工作。我相信我現在已經找到了解決方案,並在此發佈。 –