繼續我的評論以及@TenniStats的建議,最好的方法是減少GLM的大小。考慮以下幾點:
#generating some sample data that's fairly large
sample.data <- data.frame('target' = sample(c(1:10), size = 5000000, replace = T),
'regressor1' = rnorm(5000000),
'regressor2' = rnorm(5000000),
'regressor3' = rnorm(5000000),
'regressor4' = rnorm(5000000),
'regressor5' = rnorm(5000000),
'regressor6' = rnorm(5000000),
'regressor7' = rnorm(5000000),
'regressor8' = rnorm(5000000),
'regressor9' = rnorm(5000000),
'regressor10' = rnorm(5000000))
#building a toy glm - this one is about 3.3 GB
lm.mod <- glm(sample.data, formula = target ~ ., family = gaussian)
#baseline predictions
lm.default.preds <- predict(lm.mod, sample.data)
#extracting coefficients
lm.co <- coefficients(lm.mod)
#applying coefficients to original data set by row and adding intercept
lightweight.preds <- lm.co[1] +
apply(sample.data[,2:ncol(sample.data)],
1,
FUN = function(x) sum(x * lm.co[2:length(lm.co)]))
#clearing names from vector for comparison
names(lm.default.preds) <- NULL
#taa daa
all.equal(lm.default.preds, lightweight.preds)
然後我們可以做到以下幾點:
#saving for our example and starting timing
saveRDS(lm.co, file = 'myfile.RDS')
start.time <- Sys.time()
#reading from file
coefs.from.file <- readRDS('myfile.RDS')
#scoring function
light.scoring <- function(coeff, new.data) {
prediction <- coeff[1] + sum(coeff[2:length(coeff)] * new.data)
names(prediction) <- NULL
return(prediction)
}
#same as before
light.scoring(coefs.from.file, sample.data[1, 2:11])
#~.03 seconds on my machine
Sys.time() - start.time
什麼是你的問題? – user3640617
2 GB GLM:想法:我會再仔細檢查一下,你肯定不會變小。 – Zach
我認爲問題在於GLM模型保存了關於保存的培訓數據的副本。爲什麼不直接導出係數並手動計分(即讓API生成分數)?應該用簡單的線性模型足夠簡單,因爲它只是y = Beta1 * var1 + Beta2 * var2 ...等。 – BigTimeStats