h2o.glm does not match glm in R for linear regress

2020-07-27 16:18发布

问题:

I have been working with H2O.ai (version 3.10.3.6) in combination with R.

I am struggling to replicate the results from glm with h2o.glm. I would expect exactly the same result (evaluated, in this case, in terms of mean square error), but I am seeing must worse accuracy with h2o. Since my model is Gaussian, I would expect both cases to be ordinary least squares (or maximum likelihood) regressions.

Here is my example:

train <- model.matrix(~., training_df)
test <- model.matrix(~., testing_df)

model1 <- glm(response ~., data=data.frame(train))
yhat1 <- predict(model1 , newdata=data.frame(test))
mse1 <- mean((testing_df$response - yhat1)^2) #5299.128

h2o_training <- as.h2o(train)[-1,]
h2o_testing <- as.h2o(test)[-1,]

model2 <- h2o.glm(x = 2:dim(h2o_training)[2], y = 1,
                  training_frame = h2o_training,
                  family = "gaussian", alpha = 0)

yhat2 <- h2o.predict(model2, h2o_testing)
yhat2 <- as.numeric(as.data.frame(yhat2)[,1])
mse2 <- mean((testing_df$response - yhat2)^2) #8791.334

The MSE is 60% higher for the h2o model. Is my hypothesis that glm ≈ h2o.glm wrong? I will look to provide an example dataset asap (the training dataset is confidential and 350000 rows x 350 columns).

An extra question: for some reason, as.h2o adds an extra row full of NAs, so that h2o_training and h2o_testing have an additional row. Removing it (as I do here: as.h2o(train)[-1,]) before building the model does not affect the regression performance. There are no NA values passed to either glm or h2o.glm; i.e. the training matrices do not have NA values.

回答1:

There are a few arguments you need to set in order to get H2O's GLM to match R's GLM, since by default, they do not function the same way. Here is an example of what you need to set to get identical results:

library(h2o)
h2o.init(nthreads = -1)

path <- system.file("extdata", "prostate.csv", package = "h2o")
train <- h2o.importFile(filepath)

# Run GLM of VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON
x <- setdiff(colnames(train), c("ID", "DPROS", "DCAPS", "VOL"))

# Train H2O GLM (designed to match R)
h2o_glmfit <- h2o.glm(y = "VOL", 
                      x = x, 
                      training_frame = train, 
                      family = "gaussian",
                      lambda = 0,
                      remove_collinear_columns = TRUE,
                      compute_p_values = TRUE,
                      solver = "IRLSM")

# Train an R GLM
r_glmfit <- glm(VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON, 
                data = as.data.frame(train)) 

Here are the coefs (they match):

> h2o.coef(h2o_glmfit)
  Intercept     CAPSULE         AGE 
-4.35605671 -4.29056573  0.29789896 
       RACE         PSA     GLEASON 
 4.35567076  0.04945783 -0.51260829 

> coef(r_glmfit)
(Intercept)     CAPSULE         AGE 
-4.35605671 -4.29056573  0.29789896 
       RACE         PSA     GLEASON 
 4.35567076  0.04945783 -0.51260829 

I've added a JIRA ticket to add this info to the docs.



回答2:

Is my hypothesis that glm ≈ h2o.glm wrong?

The algorithm of h2o.glm is different from R's glm.

h2o.glm is actually far more similar to the glmnet R package because they both support Elastic Net regularization (and two of the authors of glmnet, Hastie and Tibshirani, are advisors to H2O.ai).

When building H2O's glm, we used glmnet as a measuring stick far more so than R's glm.

Having said all that, you shouldn't expect the exact same coefficients for the result, but I would also not expect such a dramatically worse MSE.



回答3:

I want to expand on the first answer and suggest:

solver = "IRLSM"
lambda = 0
remove_collinear_columns = TRUE
compute_p_values = TRUE
objective_epsilon = 1e-8
max_iterations = 25

glm() uses glm.control(epsilon = 1e-8, maxit = 25, trace = FALSE) for any logistic regression.



标签: r h2o