column_offset在h2o.gbm

我試圖用一些初步的預測從其他模式以適應GBM伯努利模型和我比開始預測越來越差可能性。我能夠使用泰坦尼克號數據重現它。

我能夠使用R的gbm來做我想做的事。 R的gbm.fit要求在鏈路規模上有所偏移，該規模不受限制，可能是非常高或非常低的負值。

然而，當我嘗試做相同的H2O GBM，它拋出一個錯誤：GBM模型參數非法（S）： water.exceptions.H2OModelBuilderIllegalArgumentException GBM_model_R_1489164084643_3568。詳細信息：字段上的ERRR：_offset_column：伯努利分佈的偏移量不能大於1。

我Jupyter筆記本是在這裏： Github

UPDATE 我能夠使用偏移，但是隻有一個數據幀，其中ProbabilityLink小於1。由於H2O抱怨它。請參閱筆記本中的單元格65-68。

我相信這是H2O中的一個bug。他們應該刪除伯努利的偏移量必須小於1的要求。它可以是任何東西。然後它應該可以正常工作。

來源

2017-03-16 Denisevi4

更新

舊版本的H2O（3.10.2或更少），你必須使用一個值小於1與H2O GBM的offset_column伯努利分佈。但是，對於較新的版本，您將能夠通過任何值。在你的情況下，使用Bernoulli分佈，創建偏移列的一種方法是使用以前模型中預測的logit值（就像你在評論中所說的那樣）。

這是gbm偏移列的工作原理：偏移量是在模型訓練過程中使用的每行「偏差值」。對於高斯分佈，偏移可以看作對響應（y）列的簡單校正。該模型不是學習預測響應（y行），而是學習預測響應列的（行）偏移量。對於其他分佈，在應用反向鏈接函數獲取實際響應值之前，在線性化空間中應用偏移校正。該選項不適用於多項分佈。

這裏是如何使用這個參數在玩具集

（例如伯努利分佈）

library(h2o) 
h2o.init() 

# import the cars dataset: 
# this dataset is used to classify whether or not a car is economical based on 
# the car's displacement, power, weight, and acceleration, and the year it was made 
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") 

# convert response column to a factor 
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"]) 

# create a new offset column by taking the log of the response column 
cars["offset"] <- as.h2o(rep(.5, dim(cars)[1])) 

# set the predictor names and the response column name 
predictors <- c("displacement","power","weight","acceleration","year") 
response <- "economy_20mpg" 

# split into train and validation sets 
cars.split <- h2o.splitFrame(data = cars,ratios = 0.8, seed = 1234) 
train <- cars.split[[1]] 
valid <- cars.split[[2]] 

# try using the `off_set` parameter: 
# training_frame and validation_frame 
cars_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, offset_column = "offset", 
        validation_frame = valid, seed = 1234) 

# print the auc for your model 
print(h2o.auc(cars_gbm, valid = TRUE))

高斯例子（它更有意義，使用此選項）一個例子

library(h2o) 
h2o.init() 

# import the boston dataset: 
# this dataset looks at features of the boston suburbs and predicts   median housing prices 
# the original dataset can be found at  https://archive.ics.uci.edu/ml/datasets/Housing 
boston <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") 

# set the predictor names and the response column name 
predictors <- colnames(boston)[1:13] 
# set the response column to "medv", the median value of owner-occupied  homes in $1000's 
response <- "medv" 

# convert the chas column to a factor (chas = Charles River dummy  variable (= 1 if tract bounds river; 0 otherwise)) 
boston["chas"] <- as.factor(boston["chas"]) 

# create a new offset column by taking the log of the response column 
boston["offset"] <- log(boston["medv"]) 

# split into train and validation sets 
boston.splits <- h2o.splitFrame(data = boston, ratios = .8, seed = 1234) 
train <- boston.splits[[1]] 
valid <- boston.splits[[2]] 

# try using the `offset_column` parameter: 
# train your model, where you specify the offset_column 
boston_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, 
       validation_frame = valid, 
       offset_column = "offset", 
       seed = 1234) 

# print the mse for validation set 
print(h2o.mse(boston_gbm, valid = TRUE))

來源

2017-03-17 16:17:08 Lauren

嗨，勞倫，你能舉一個伯努利的例子嗎？這就是我的困惑所在。你的例子是高斯。 – Denisevi4

根據https://en.wikipedia.org/wiki/Generalized_linear_model，我需要用來獲得伯努利線性空間的鏈接函數是logistic ln（mu /（1-mu）），這就是我使用的對於我的例子....但H2O抱怨偏移量必須小於1.我做錯了什麼？或者更重要的是，H2O希望我做什麼？ – Denisevi4

我將此波士頓例子添加到我的[GitHub上的筆記本]（https：// github。COM/Denisevi4 /測試/斑點/主/ Test_Titanic.ipynb）。我將它改編成我想要做的，不僅僅是一些隨意的抵消，而是來自glm的預測。它工作正常。所以，我對高斯沒有任何問題。但伯努利仍然是一個問題 – Denisevi4

column_offset在h2o.gbm

回答

相關問題