2017-06-19 19 views
1

我一直試圖在sparkR中的數據集上安裝一個glm(泊松與日誌鏈接,具體)。它非常大,因此收集它並使用R自己的glm()不太可能奏效。這包括需要包含的一個暴露術語作爲偏移量(在我的情況下,已知係數的迴歸函數-1)。不幸的是,既沒有在公式中添加偏移項,也沒有傳遞列名稱(或者列本身,或者在選擇它之後收集coumn而形成的數字向量) - 在第一種情況下公式不被分析,並且在其他情況下的偏移項被忽略 - 根本沒有錯誤消息。這裏是什麼,我一直在努力做的(在註釋輸出)例如:glm()中的offset()項sparkR 2.1.0被忽略?

library(datasets) 
#set up Spark session 
#Sys.setenv(SPARK_HOME = "/usr/share/spark_2.1.0") 
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 
options(scipen = 15, digits = 5) 
sparkR.session(spark.executor.instances = "20", spark.executor.memory = "6g") 
# # Setting default log level to "WARN". 
# # To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 
# # 17/06/19 06:33:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
# # 17/06/19 06:33:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 
# # 17/06/19 06:34:22 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException 
message(sparkR.conf()$spark.app.id) 
# # application_*************_**** 

#Test glm() in sparkR 
data("iris") 
iris_df = createDataFrame(iris) 
# # Warning messages: 
# # 1: In FUN(X[[i]], ...) : 
# # Use Sepal_Length instead of Sepal.Length as column name 
# # 2: In FUN(X[[i]], ...) : 
# # Use Sepal_Width instead of Sepal.Width as column name 
# # 3: In FUN(X[[i]], ...) : 
# # Use Petal_Length instead of Petal.Length as column name 
# # 4: In FUN(X[[i]], ...) : 
# # Use Petal_Width instead of Petal.Width as column name 
model = glm(Sepal_Length ~ offset(Sepal_Width) + Petal_Length, data = iris_df) 
# # 17/06/19 08:46:47 ERROR RBackendHandler: fit on org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed 
# # java.lang.reflect.InvocationTargetException 
# # ...... 
# # Caused by: java.lang.IllegalArgumentException: Could not parse formula: Sepal_Length ~ offset(Sepal_Width) + Petal_Length 
# # at org.apache.spark.ml.feature.RFormulaParser$.parse(RFormulaParser.scala:200) 
# # ...... 
model = glm(Sepal_Length ~ Petal_Length + offset(Sepal_Width), data = iris_df) 
# # (Same error as above) 
# The one below runs. 
model = glm(Sepal_Length ~ Petal_Length, offset = Sepal_Width, data = iris_df, family = gaussian()) 
# # 17/06/19 08:51:21 WARN WeightedLeastSquares: regParam is zero, which might cause numerical instability and overfitting. 
# # 17/06/19 08:51:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 
# # 17/06/19 08:51:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 
# # 17/06/19 08:51:24 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK 
# # 17/06/19 08:51:24 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK 
summary(model) 
# # Deviance Residuals: 
# # (Note: These are approximate quantiles with relative error <= 0.01) 
# # Min  1Q Median  3Q  Max 
# # -1.24675 -0.30140 -0.01999 0.26700 1.00269 
# # 
# # Coefficients: 
# # Estimate Std. Error t value Pr(>|t|) 
# # (Intercept) 4.3066 0.078389 54.939 0  
# # Petal_Length 0.40892 0.018891 21.646 0  
# # 
# # (Dispersion parameter for gaussian family taken to be 0.1657097) 
# # 
# # Null deviance: 102.168 on 149 degrees of freedom 
# # Residual deviance: 24.525 on 148 degrees of freedom 
# # AIC: 160 
# # 
# # Number of Fisher Scoring iterations: 1 
# (RESULTS ARE SAME AS GLM WITHOUT OFFSET) 

# Results in R: 
model = glm(Sepal.Length ~ Petal.Length, offset = Sepal.Width, data = iris, family = gaussian()) 
summary(model) 
# # Call: 
# # glm(formula = Sepal.Length ~ Petal.Length, family = gaussian(), 
# #  data = iris, offset = Sepal.Width) 
# # 
# # Deviance Residuals: 
# # Min  1Q Median  3Q  Max 
# # -0.93997 -0.27232 -0.02085 0.28576 0.88944 
# # 
# # Coefficients: 
# # Estimate Std. Error t value Pr(>|t|)  
# # (Intercept) 0.85173 0.07098 12.00 <2e-16 *** 
# # Petal.Length 0.51471 0.01711 30.09 <2e-16 *** 
# # --- 
# # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
# # 
# # (Dispersion parameter for gaussian family taken to be 0.1358764) 
# # 
# # Null deviance: 143.12 on 149 degrees of freedom 
# # Residual deviance: 20.11 on 148 degrees of freedom 
# # AIC: 130.27 
# # 
# # Number of Fisher Scoring iterations: 2 

#Results in R without offset. Matches SparkR output with and w/o offset. 
model = glm(Sepal.Length ~ Petal.Length, data = iris, family = gaussian()) 
summary(model) 
# # Call: 
# # glm(formula = Sepal.Length ~ Petal.Length, family = gaussian(), 
# #  data = iris) 
# # 
# # Deviance Residuals: 
# # Min  1Q Median  3Q  Max 
# # -1.24675 -0.29657 -0.01515 0.27676 1.00269 
# # 
# # Coefficients: 
# # Estimate Std. Error t value Pr(>|t|)  
# # (Intercept) 4.30660 0.07839 54.94 <2e-16 *** 
# # Petal.Length 0.40892 0.01889 21.65 <2e-16 *** 
# # --- 
# # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
# # 
# # (Dispersion parameter for gaussian family taken to be 0.1657097) 
# # 
# # Null deviance: 102.168 on 149 degrees of freedom 
# # Residual deviance: 24.525 on 148 degrees of freedom 
# # AIC: 160.04 
# # 
# # Number of Fisher Scoring iterations: 2 

注:星火版本是2.1.0(在代碼中)。從我檢查的實施應該是在那裏。此外,gl之後的警告消息並不總是顯示,但這似乎並不影響正在發生的事情。

我做錯了什麼,或者是glm實現spark中沒有使用的偏移項?如果是第二個,是否有任何解決方法可以獲得與具有偏移項相同的結果?

回答

1

與響應Y和偏移數(K)的泊松GLM相同使用保險數據集從MASS與響應Y/K和權重K.

實施例一個GLM:

> glm(Claims ~ District + Group + Age, data=Insurance, family=poisson, offset=log(Holders)) 

Call: glm(formula = Claims ~ District + Group + Age, family = poisson, 
    data = Insurance, offset = log(Holders)) 

Coefficients: 
(Intercept) District2 District3 District4  Group.L  Group.Q  Group.C  Age.L  Age.Q  Age.C 
    -1.810508  0.025868  0.038524  0.234205  0.429708  0.004632 -0.029294 -0.394432 -0.000355 -0.016737 

Degrees of Freedom: 63 Total (i.e. Null); 54 Residual 
Null Deviance:  236.3 
Residual Deviance: 51.42 AIC: 388.7 



> glm(Claims/Holders ~ District + Group + Age, data=Insurance, family=quasipoisson, weights=Holders) 

Call: glm(formula = Claims/Holders ~ District + Group + Age, family = quasipoisson, 
    data = Insurance, weights = Holders) 

Coefficients: 
(Intercept) District2 District3 District4  Group.L  Group.Q  Group.C  Age.L  Age.Q  Age.C 
    -1.810508  0.025868  0.038524  0.234205  0.429708  0.004632 -0.029294 -0.394432 -0.000355 -0.016737 

Degrees of Freedom: 63 Total (i.e. Null); 54 Residual 
Null Deviance:  236.3 
Residual Deviance: 51.42 AIC: NA 

quasipoisson系列將關閉響應檢測到的非整數值。)

該技術也應該可用於Spark的GLM實現。

另請參見a similar question on stats.SE.

+0

這應該工作,但不是如果您使用'glm'。我已經upvoted了,但你介意添加SparkR解決方案'spark.glm'和'weightCol'而不是'weight'。 – zero323

+0

@ zero323我還沒有使用Spark的GLM實現。隨意添加必要的信息。 –

+0

似乎不適用於SparkR: ErrorError(returnStatus,conn): java.lang.IllegalArgumentException:glm_6fa65f4a1164參數系列給出無效值quasipoisson。 – user2715583