update() a model inside a function with local cova

2019-06-16 15:45发布

问题:

I need to update a regression model from inside a function. Ideally, the function should work with any kind of models (lm, glm, multinom, clm). More precisely, I need to add one or several covariates that are defined inside the function. Here is an exemple.

MyUpdate <- function(model){
     randData <- data.frame(var1=rnorm(length(model$residuals)))
     model2 <- update(model, ".~.+randData$var1")
     return(model2)
}

Here is an example use

data(iris)
model1 <- lm(Sepal.Length~Species, data=iris)
model2 <- MyUpdate(model1)

Error in eval(expr, envir, enclos) : object 'randData' not found

Here is another example with glm

model1 <- glm(Sepal.Length>5~Species, data=iris, family=binomial)
model2 <- MyUpdate(model1)

Any idea?

回答1:

The problem is that var1 is looked up in the data frame and the model's environment but not within the environment in MyUpdate.

1) To avoid this problem update the model with not only the revised formula but also a revised data frame containing var1 :

MyUpdate <- function(model) {
     mf <- model.frame(model)
     n <- nrow(mf)
     var1 <- rnorm(n)
     update(model, formula = . ~ . + var1, data = data.frame(mf, var1))
}

The above is probably the best solution of the ones presented in this answer as it avoids mucking around with internal structures. It seems to work for lm, glm, multinom and clm. The other solutions below do muck around with internal structures and therefore are less general across model fitting routines. The others all work with lm but may not work for others.

test Here is a test which runs without errors on each of the model fitting functions mentioned in the question if MyUpdate is as above and also the solutions in (2) all run the tests without error. The solution (3) works at least with lm.

model.lm <- lm(Sepal.Length~Species, data=iris)
MyUpdate(model.lm)

model.glm <- glm(Sepal.Length~Species, data=iris)
MyUpdate(model.glm)

library(nnet)
example(multinom)
MyUpdate(bwt.mu)

library(ordinal)
model.clm <- clm(rating ~ temp * contact, data = wine)
MyUpdate(model.clm)

The remaining solutions perform more direct access of internals making them less robust to changing the model function.

2) Messing with Environments

In addition here are three solutions that involve messing with environments. The first is the cleanest followed by the second and then the third. The third is the least acceptable since it actually writes var1 into the model's environment (dangerously overwriting any var1 there) but it is the shortest. They work with lm, glm multinom and clm.

Note that we do not really need to put var1 into a data frame nor is it necessary to put the updating formula in quotes and we have changed both in all examples below. Also the return statement can be removed and we have done that too.

2a) The following modifies the environment of the original model to point to a new proxy proto object containing var1 whose parent is the original model environment. Here proto(p, var1 = rnorm(n)) is the proxy proto object (a proto object is an environment with differing semantics) and p is the parent of the proxy.

library(proto)

MyUpdate <- function(model){

     mf <- model.frame(model)
     n <- nrow(mf)
     var1 <- rnorm(n)
     p <- environment(formula(model))

     if (is.null(model$formula)) {
           attr(model$terms, ".Environment") <- proto(p, var1 = var1)
     } else environment(model$formula) <- proto(p, var1 = var1)

     update(model, . ~ . + var1)
}

For more information read the Proxies section in this document: http://r-proto.googlecode.com/files/prototype_approaches.pdf

2b) This could alternately be done without proto but at the expense of expanding the ## line to three lines containing some additional ugly environment manipulations. Here e is the proxy environment.

MyUpdate <- function(model){
     mf <- model.frame(model)
     n <- nrow(mf)
     var1 <- rnorm(n)
     p <- environment(formula(model))

     e <- new.env(parent = p)
     e$var1 <- var1

     if (is.null(model$formula)) attr(model$terms, ".Environment") <- e
     else environment(model$formula) <- e

     update(model, . ~ . + var1)
}

2c) Shortest but the most hackish is to stick var1 into the original model environment:

MyUpdate <- function(model){
     mf <- model.frame(model)
     n <- nrow(mf)
     var1 <- rnorm(n)       

     if (is.null(model$formula)) attr(model$terms, ".Environment")$var1 <- var1
     else environment(model$formula)$var1 <- var1

     update(model, . ~ . + var1)
}

3) eval/substitute This solution does use eval which is sometimes frowned upon. It works on lm and glm and on clm it works except that the output does not display var1 but rather the expression that computes it.

MyUpdate <- function(model) {
     m <- eval.parent(substitute(update(model, . ~ . + rnorm(nrow(model.frame(model))))))
     m$call$formula <- update(formula(model), . ~ . + var1)
     names(m$coefficients)[length(m$coefficient)] <- "var1"
     m
}

REVISED Added additional solutions, simplified (1), got solutions in (2) to run all examples in test section.



回答2:

Some theory. A formula object often has an associated environment:

frm1 <- y~x # a formula created in the global environment ("in the console")
attr(frm1, ".Environment") # see also unclass(frm1)
## <environment: R_GlobalEnv>

Here, the functions acting on frm1 will know that they shall seek for y and x in the global environment (unless stated otherwise, see e.g. data arg of lm()). On the other hand:

f <- function() { y~x }; frm2 <- f() # a formula created in a function
attr(frm2, ".Environment")
## <environment: 0x2f87e48>

Such a formula points to y and x being the "local variables" in f().

If you pass a formula created in the global environment to a function, you will is most cases not be able to refer to the objects created within that function.

The solution. The underlying formula and environment is somewhat "hidden" withing the object returned by lm(). But they can be accessed. The code below should solve your issue.

MyUpdate <- function(model){
     assign("randData", data.frame(var1=rnorm(length(model$residuals))),
        envir=attr(model1$terms, ".Environment"))
     model2 <- update(model, ".~.+randData$var1")
     return(model2)
}


标签: r glm lm