Print “pretty” tables for h2o models in R

2019-01-07 19:30发布

问题:

There are multiple packages for R which help to print "pretty" tables (LaTeX/HTML/TEXT) from statistical models output AND to easily compare the results of alternative model specifications.

Some of these packages are apsrtable, xtable, memisc, texreg, outreg, and stargazer (for examples see here: https://www.r-statistics.com/2013/01/stargazer-package-for-beautiful-latex-tables-from-r-statistical-models-output/).

Is there any comparable R package that does support the models of the h2o package?

Here is an example of two simple GLM models with h2o which I like to print beside each other as "beautiful" tables.

# Load package and setup h2o
library(h2o)
localH2O <- h2o.init(ip = 'localhost', port = 54321, max_mem_size = '4g')

# Load data
prostatePath <- system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex <- h2o.importFile(path = prostatePath, destination_frame = "prostate.hex")

# Run GLMs
model.output.1 <- h2o.glm(y = "CAPSULE", x = c("RACE","PSA","DCAPS"),
  training_frame = prostate.hex,family = "binomial", nfolds = 0, 
  alpha = 0.5, lambda_search = FALSE)
model.output.2 <- h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), 
  training_frame = prostate.hex, family = "binomial", nfolds = 0, 
  alpha = 0.5, lambda_search = FALSE)

This is how it would look like with an regular GLM object using screenreg() from the texreg package:

library(data.table)
library(texreg)
d <- fread(prostatePath)
model.output.1.glm <- glm(CAPSULE ~ RACE + PSA + DCAPS, data=d)
model.output.2.glm <- glm(CAPSULE ~ AGE + RACE + PSA + DCAPS, data=d)
screenreg(list(model.output.1.glm, model.output.2.glm))

回答1:

texreg package author here. texreg is based on generic functions, which means any user can add custom extract methods for arbitrary models and make them work with texreg. I will walk you through this below. I am hoping that this detailed exposition will help other people who have asked similar questions in the past to devise their own texreg extensions.

See also Section 6 in the 2013 paper in the Journal of Statistical Software for another example. First, however, I will describe how the texreg architecture works more generally to give you an idea what is possible.

texreg and the generic extract function

There are three functions: texreg (for LaTeX output), htmlreg (for HTML output, which can also be interpreted by Word or Markdown in most usage scenarios), and screenreg (for ASCII text output in the console). These three functions serve to convert some cleaned-up information (coefficients, standard errors, confidence intervals, p-values, goodness-of-fit statistics, model labels etc.) into the respective output formats. This is not perfect and could be even more flexible, but I think it has quite a few arguments for customization at this point, including things like booktabs and dcolumn support. So the big challenge is rather to get cleaned-up model information in the first place.

This is done by providing a texreg object to any of these three functions. A texreg object is just a container for coefficients etc. and is formally defined using an S4 class. To create a texreg object, you can either use the constructor function createTexreg (as documented on the help pages), which accepts all the different pieces of information as arguments, such as the standard errors etc. Or (better) you can use the extract function to extract those pieces of information from some estimated model and return a texreg object for use with any of the three functions. The way you typically do this is by just handing over a list of several models to the texreg or screenreg etc. function, and this function will internally call extract to create texreg objects and then process the information from those objects.

However, it is equally valid to just save the output of a call of the extract function to an object, possibly manipulate this texreg object, and then call the texreg function on the manipulated object to display it as a table. This permits some flexibility in tweaking the results.

Earlier on, I mentioned that the package uses generic functions. This means that the extract function is generic in the sense that you can register methods for it that work with arbitrary classes of models. For example, if the extract function does not know how to handle h2o objects and how to extract the relevant information from such an object, you can just write a method to do that and register it with the extract function. Below, I will walk you through this step by step in the hope that people will learn from this detailed exposition and start writing their own extensions. (Note: if somebody develops a useful method, please e-mail me and I can include it in the next texreg release.) It is also worth pointing out that the source files of the package contain more than 70 examples of extract methods which you can use as templates. These examples are stored in the file R/extract.R.

Identifying the class label and setting up an extract method

The first step is to identify the class name of the object. In your example, class(model.output.1) returns the following class labels: "H2OBinomialModel" and "h2o". The first label is the more specific one, and the second label is the more general one. If all h2o model objects are structured in a similar way, it will make sense to write an extension for h2o objects and then decide within the method how to proceed with the specific model. As I know virtually nothing about the h2o package, I prefer to start out with a more specific extract method for H2OBinomialModel objects in this case. It can be adjusted later on should we wish to do so.

extract methods are structured as follows: you write a function called extract.xyz (replace "xyz" by the class label), have at least one argument called model, which accepts the model object (e.g., model.output.1 in your example), put some code in the body that extracts the relevant information from the model object, create a texreg object using the createTexreg constructor, and return this object. Here is an empty container:

extract.H2OBinomialModel <- function(model, ...) {
  s <- summary(model)

  # extract information from model and summary object here

  # then create and return a texreg object (replace NULL with actual values):
  tr <- createTexreg(
    coef.names = NULL,    # character vector of coefficient labels
    coef = NULL,          # numeric vector with coefficients
    se = NULL,            # numeric vector with standard error values
    pvalues = NULL,       # numeric vector with p-values
    gof.names = NULL,     # character vector with goodness-of-fit labels
    gof = NULL,           # numeric vector of goodness-of-fit statistics
    gof.decimal = NULL    # logical vector: GOF statistic has decimal points?
  )
  return(tr)
}

Note that the function definition also contains the ... argument, which can be used for custom arguments that should be handed over to function calls within the extract method.

Note also that the first line in the body of the function definition saves the summary of the model in an object called s. This is often useful because many package writers decide to store some of the information in a simpler version in the summary, so one should usually consider both, the model and its summary, as useful sources of information. In some cases, one may have to look at the actual definition of the summary method in the respective package to find out how the pieces of information displayed on the summary page are computed when the summary command is called because not all summary methods store the different elements displayed in the summary object.

Locating the right information in an H2OBinomialModel object

The next step is to examine the object and locate all the details that should be displayed in the final table. By looking at the output of model.output.1, I would guess that the following part should constitute the GOF block at the bottom of the table:

MSE:  0.202947
R^2:  0.1562137
LogLoss:  0.5920097
Mean Per-Class Error:  0.3612191
AUC:  0.7185655
Gini:  0.4371311
Null Deviance:  512.2888
Residual Deviance:  449.9274
AIC:  457.9274

And the following part should probably constitute the coefficient block in the middle of the table:

Coefficients: glm coefficients
      names coefficients standardized_coefficients
1 Intercept    -1.835223                 -0.336428
2      RACE    -0.625222                 -0.193052
3     DCAPS     1.314428                  0.408336
4       PSA     0.046861                  0.937107

In many cases, the summary contains the relevant information, but here printing the model yields what we need. We will need to locate all of this in the model.output.1 object (or its summary if applicable). To do that, there are several useful commands. Among them are str(model.output.1), names(summary(model.output.1)), and similar commands.

Let's start with the coefficient block. Calling str(model) reveals that there is a slot called model in the S4 object. We can look at its contents by calling model.output.1@model. The result is a list with several named elements, among them coefficients_table. So we can access the coefficient table by calling model.output.1@model$coefficients_table. The result is a data frame the columns of which we can access using the $ operator. In particular, we need the names and the coefficients. There are two types of coefficients here, standardized and unstandardized, and we can add an argument to our extract method later to let the user decide what he or she wants. Here is how we extract the coefficients and their labels:

coefnames <- model.output.1@model$coefficients_table$names
coefs <- model.output.1@model$coefficients_table$coefficients
coefs.std <- model.output.1@model$coefficients_table$standardized_coefficients

As far as I can see, there are no standard errors or p-values stored in the object. We could write some additional code to compute them should we wish to do that, but here we will focus on things that are readily provided as part of the model output.

It is important that we should not overwrite any existing function names in R, such as names or coef. While doing this should technically work because the code is executed within a function later, this can easily lead to confusion while trying things out, so you should better avoid that.

Next, we need to locate the goodness-of-fit statistics. By examining the output of str(model.output.1) carefully, we see that the goodness-of-fit statistics are contained in several slots under model@model$training_metrics@metrics. Let's save them to some objects that are easier to access:

mse <- model.output.1@model$training_metrics@metrics$MSE
r2 <- model.output.1@model$training_metrics@metrics$r2
logloss <- model.output.1@model$training_metrics@metrics$logloss
mpce <- model.output.1@model$training_metrics@metrics$mean_per_class_error
auc <- model.output.1@model$training_metrics@metrics$AUC
gini <- model.output.1@model$training_metrics@metrics$Gini
nulldev <- model.output.1@model$training_metrics@metrics$null_deviance
resdev <- model.output.1@model$training_metrics@metrics$residual_deviance
aic <- model.output.1@model$training_metrics@metrics$AIC

In some cases, but not here, the author of a package writes methods for generic functions that can be used to extract some common information like the number of observations (nobs(model)), AIC (AIC(model)), BIC (BIC(model)), deviance (deviance(model)), or the log likelihood (logLik(model)[[1]]). So these are things you may want to try first; but the h2o package does not seem to offer such convenience methods.

Adding the information to the extract.H2OBinomialModel function

Now that we have located all pieces of information we need, we can add them to the extract.H2OBinomialModel function we defined above.

However, we would like to let the user decide if he or she prefers raw or standardized coefficients, and we would like to let the user decide which goodness-of-fit statistics should be reported, so we add various logical arguments to the function header and then use if-conditions inside the function to check whether we should embed the respective statistics in the resulting texreg object.

We also remove the line s <- summary(model) in this case because we don't actually need any kind of information from the summary since we found everything we need in the model object.

The complete function may look like this:

# extension for H2OBinomialModel objects (h2o package)
extract.H2OBinomialModel <- function(model, standardized = FALSE, 
      include.mse = TRUE, include.rsquared = TRUE, include.logloss = TRUE, 
      include.meanerror = TRUE, include.auc = TRUE, include.gini = TRUE, 
      include.deviance = TRUE, include.aic = TRUE, ...) {

  # extract coefficient table from model:
  coefnames <- model@model$coefficients_table$names
  if (standardized == TRUE) {
    coefs <- model@model$coefficients_table$standardized_coefficients
  } else {
    coefs <- model@model$coefficients_table$coefficients
  }

  # create empty GOF vectors and subsequently add GOF statistics from model:
  gof <- numeric()
  gof.names <- character()
  gof.decimal <- logical()
  if (include.mse == TRUE) {
    mse <- model@model$training_metrics@metrics$MSE
    gof <- c(gof, mse)
    gof.names <- c(gof.names, "MSE")
    gof.decimal <- c(gof.decimal, TRUE)
  }
  if (include.rsquared == TRUE) {
    r2 <- model@model$training_metrics@metrics$r2
    gof <- c(gof, r2)
    gof.names <- c(gof.names, "R^2")
    gof.decimal <- c(gof.decimal, TRUE)
  }
  if (include.logloss == TRUE) {
    logloss <- model@model$training_metrics@metrics$logloss
    gof <- c(gof, logloss)
    gof.names <- c(gof.names, "LogLoss")
    gof.decimal <- c(gof.decimal, TRUE)
  }
  if (include.meanerror == TRUE) {
    mpce <- model@model$training_metrics@metrics$mean_per_class_error
    gof <- c(gof, mpce)
    gof.names <- c(gof.names, "Mean Per-Class Error")
    gof.decimal <- c(gof.decimal, TRUE)
  }
  if (include.auc == TRUE) {
    auc <- model@model$training_metrics@metrics$AUC
    gof <- c(gof, auc)
    gof.names <- c(gof.names, "AUC")
    gof.decimal <- c(gof.decimal, TRUE)
  }
  if (include.gini == TRUE) {
    gini <- model@model$training_metrics@metrics$Gini
    gof <- c(gof, gini)
    gof.names <- c(gof.names, "Gini")
    gof.decimal <- c(gof.decimal, TRUE)
  }
  if (include.deviance == TRUE) {
    nulldev <- model@model$training_metrics@metrics$null_deviance
    resdev <- model@model$training_metrics@metrics$residual_deviance
    gof <- c(gof, nulldev, resdev)
    gof.names <- c(gof.names, "Null Deviance", "Residual Deviance")
    gof.decimal <- c(gof.decimal, TRUE, TRUE)
  }
  if (include.aic == TRUE) {
    aic <- model@model$training_metrics@metrics$AIC
    gof <- c(gof, aic)
    gof.names <- c(gof.names, "AIC")
    gof.decimal <- c(gof.decimal, TRUE)
  }

  # create texreg object:
  tr <- createTexreg(
    coef.names = coefnames, 
    coef = coefs, 
    gof.names = gof.names, 
    gof = gof, 
    gof.decimal = gof.decimal
  )
  return(tr)
}

For the goodness-of-fit block, you can see that I first created empty vectors and then subsequently populated them with additional statistics provided that the respective statistic was switched on by the user using the respective argument.

The gof.decimal logical vector indicates for each GOF statistic whether it has decimal places (TRUE) or not (FALSE, as in the number of observations, for example).

Finally, the new function needs to be registered as a method for the generic extract function. This is done using a simple command:

setMethod("extract", signature = className("H2OBinomialModel", "h2o"), 
definition = extract.H2OBinomialModel)

Here, the first argument of className is the class label, and the second one is the package in which the class is defined.

Summing up, the only two things that need to be done in order to write a custom extension are 1) writing an extract method, and 2) registering the method. That is, this code can be executed at runtime and doesn't have to be inserted into any package.

However, for your convenience, I have added the H2OBinomialModel method to texreg version 1.36.13, which is available on CRAN.

Note that the solution presented here does not work with any previous version of texreg because previous versions could not deal with models that had neither standard errors nor confidence intervals. This is a fairly specialized setup in my opinion, and I hadn't come across a package that merely provides the estimates without any uncertainty measures. I have now fixed this in texreg.

Trying out the new extract method

Once the function definition and the setMethod command have been executed at runtime, the following command can be used to create a table:

screenreg(list(model.output.1, model.output.2), custom.note = "")

This is the output:

======================================
                      Model 1  Model 2
--------------------------------------
Intercept              -1.84    -1.11 
RACE                   -0.63    -0.62 
DCAPS                   1.31     1.31 
PSA                     0.05     0.05 
AGE                             -0.01 
--------------------------------------
MSE                     0.20     0.20 
R^2                     0.16     0.16 
LogLoss                 0.59     0.59 
Mean Per-Class Error    0.36     0.38 
AUC                     0.72     0.72 
Gini                    0.44     0.44 
Null Deviance         512.29   512.29 
Residual Deviance     449.93   449.51 
AIC                   457.93   459.51 
======================================

The custom.note = "" argument makes sense here because we don't want a significance legend since the models do not report any uncertainty measures.

It is also possible to suppress some of the GOF measures and/or use standardized coefficients:

screenreg(list(model.output.1, model.output.2), custom.note = "", 
    include.deviance = FALSE, include.auc = FALSE, standardized = TRUE)

The result:

======================================
                      Model 1  Model 2
--------------------------------------
Intercept              -0.34    -0.34 
RACE                   -0.19    -0.19 
DCAPS                   0.41     0.41 
PSA                     0.94     0.94 
AGE                             -0.07 
--------------------------------------
MSE                     0.20     0.20 
R^2                     0.16     0.16 
LogLoss                 0.59     0.59 
Mean Per-Class Error    0.36     0.38 
Gini                    0.44     0.44 
AIC                   457.93   459.51 
======================================

Other slots that can be used with createTexreg

The createTexreg constructor function is called within any extract method. The example above shows some simple pieces of information. Besides the details contained in the example, the following slots are available in texreg objects:

  • se: standard errors
  • pvalues: the p-values
  • ci.low: lower bound of a confidence interval
  • ci.up: upper bound of a confidence interval
  • model.name: the title of the current model

Note that either confidence intervals or standard errors and p-values should be used, not both.

Manipulating texreg objects

Besides handing over the models to the screenreg, texreg, or htmlreg functions directly, it is possible to save the extracted information to a texreg object first and manipulate it before the table is displayed or saved. The added value is that even complex changes to a table are easy to apply this way. For example, it is possible to rename coefficients or GOF statistics, add new rows, rename a model, modify the values, or change the order of coefficients or GOF statistics. Here is how you can do that: First, you call the extract function to save the information to a texreg object:

tr <- extract(model.output.1)

You can display the contents of the texreg object by just calling tr, which shows the following output:

No standard errors and p-values were defined for this texreg object.

                coef.
Intercept -1.83522343
RACE      -0.62522179
DCAPS      1.31442834
PSA        0.04686106

                             GOF dec. places
MSE                    0.2029470        TRUE
R^2                    0.1562137        TRUE
LogLoss                0.5920097        TRUE
Mean Per-Class Error   0.3612191        TRUE
AUC                    0.7185655        TRUE
Gini                   0.4371311        TRUE
Null Deviance        512.2888402        TRUE
Residual Deviance    449.9273825        TRUE
AIC                  457.9273825        TRUE

Alternatively, this is the structure of the object as shown by str(tr):

Formal class 'texreg' [package "texreg"] with 10 slots
  ..@ coef.names : chr [1:4] "Intercept" "RACE" "DCAPS" "PSA"
  ..@ coef       : num [1:4] -1.8352 -0.6252 1.3144 0.0469
  ..@ se         : num(0) 
  ..@ pvalues    : num(0) 
  ..@ ci.low     : num(0) 
  ..@ ci.up      : num(0) 
  ..@ gof.names  : chr [1:9] "MSE" "R^2" "LogLoss" "Mean Per-Class Error" ...
  ..@ gof        : num [1:9] 0.203 0.156 0.592 0.361 0.719 ...
  ..@ gof.decimal: logi [1:9] TRUE TRUE TRUE TRUE TRUE TRUE ...
  ..@ model.name : chr(0) 

Now you can just manipulate this object in arbitrary ways, e.g., add a GOF statistic:

tr@gof.names <- c(tr@gof.names, "new statistic")
tr@gof <- c(tr@gof, 12)
tr@gof.decimal <- c(tr@gof.decimal, FALSE)

Or you can change the order of coefficients:

tr@coef.names <- tr@coef.names[c(4, 1, 2, 3)]
tr@coef <- tr@coef[c(4, 1, 2, 3)]

When you are done with the manipulations, you can hand over the texreg object instead of the original model when you call, for example, screenreg:

screenreg(list(tr, model.output.2), custom.note = "")

The new result will look like this:

======================================
                      Model 1  Model 2
--------------------------------------
PSA                     0.05     0.05 
Intercept              -1.84    -1.11 
RACE                   -0.63    -0.62 
DCAPS                   1.31     1.31 
AGE                             -0.01 
--------------------------------------
MSE                     0.20     0.20 
R^2                     0.16     0.16 
LogLoss                 0.59     0.59 
Mean Per-Class Error    0.36     0.38 
AUC                     0.72     0.72 
Gini                    0.44     0.44 
Null Deviance         512.29   512.29 
Residual Deviance     449.93   449.51 
AIC                   457.93   459.51 
new statistic          12             
======================================

TL;DR

texreg can be customized by users. Just write an extract method like the one shown above and register it using a setMethods call. I have included the H2OBinomialModel method in the latest texreg version 1.36.13, along with a bugfix for using models without standard errors (such as this one).



回答2:

you can use the R xtable package with h2o's H2OTable (or knitr if you convert the H2OTable to an H2OFrame using as.h2o(your_H2OTable) ), if you extract them from the model output.

for example to create a beautiful table from a model's coefficients you would need to first extract the coefficients table with model.output.1@model$coefficients_table, then you can use xtable: xtable(prostate.glm@model$coefficients_table) to print out the Latex code.

for side by side views there are multiple posts on how to do this in knitr or xtable, or xtable and sweave



回答3:

No, there is not a package that does this currently. The broom package doesn't support H2O models yet -- that would be cool! Maybe that could happen in the future. Once there is a way to "tidy" model output into an R data.frame using broom or similar functionality, then xtable, etc. will work nicely.