I need to update a regression model from inside a function. Ideally, the function should work with any kind of models (lm
, glm
, multinom
, clm
). More precisely, I need to add one or several covariates that are defined inside the function. Here is an exemple.
MyUpdate <- function(model){
randData <- data.frame(var1=rnorm(length(model$residuals)))
model2 <- update(model, ".~.+randData$var1")
return(model2)
}
Here is an example use
data(iris)
model1 <- lm(Sepal.Length~Species, data=iris)
model2 <- MyUpdate(model1)
Error in eval(expr, envir, enclos) : object 'randData' not found
Here is another example with glm
model1 <- glm(Sepal.Length>5~Species, data=iris, family=binomial)
model2 <- MyUpdate(model1)
Any idea?
The problem is that var1
is looked up in the data frame and the model's environment but not within the environment in MyUpdate
.
1) To avoid this problem update the model with not only the revised formula but also a revised data frame containing var1
:
MyUpdate <- function(model) {
mf <- model.frame(model)
n <- nrow(mf)
var1 <- rnorm(n)
update(model, formula = . ~ . + var1, data = data.frame(mf, var1))
}
The above is probably the best solution of the ones presented in this answer as it avoids mucking around with internal structures. It seems to work for lm
, glm
, multinom
and clm
. The other solutions below do muck around with internal structures and therefore are less general across model fitting routines. The others all work with lm
but may not work for others.
test Here is a test which runs without errors on each of the model fitting functions mentioned in the question if MyUpdate
is as above and also the solutions in (2) all run the tests without error. The solution (3) works at least with lm
.
model.lm <- lm(Sepal.Length~Species, data=iris)
MyUpdate(model.lm)
model.glm <- glm(Sepal.Length~Species, data=iris)
MyUpdate(model.glm)
library(nnet)
example(multinom)
MyUpdate(bwt.mu)
library(ordinal)
model.clm <- clm(rating ~ temp * contact, data = wine)
MyUpdate(model.clm)
The remaining solutions perform more direct access of internals making them less robust to changing the model function.
2) Messing with Environments
In addition here are three solutions that involve messing with environments. The first is the cleanest followed by the second and then the third. The third is the least acceptable since it actually writes var1
into the model's environment (dangerously overwriting any var1
there) but it is the shortest. They work with lm
, glm
multinom
and clm
.
Note that we do not really need to put var1
into a data frame nor is it necessary to put the updating formula in quotes and we have changed both in all examples below. Also the return
statement can be removed and we have done that too.
2a) The following modifies the environment of the original model to point to a new proxy proto object containing var1
whose parent is the original model environment. Here proto(p, var1 = rnorm(n))
is the proxy proto object (a proto object is an environment with differing semantics) and p
is the parent of the proxy.
library(proto)
MyUpdate <- function(model){
mf <- model.frame(model)
n <- nrow(mf)
var1 <- rnorm(n)
p <- environment(formula(model))
if (is.null(model$formula)) {
attr(model$terms, ".Environment") <- proto(p, var1 = var1)
} else environment(model$formula) <- proto(p, var1 = var1)
update(model, . ~ . + var1)
}
For more information read the Proxies section in this document: http://r-proto.googlecode.com/files/prototype_approaches.pdf
2b) This could alternately be done without proto but at the expense of expanding the ## line to three lines containing some additional ugly environment manipulations. Here e
is the proxy environment.
MyUpdate <- function(model){
mf <- model.frame(model)
n <- nrow(mf)
var1 <- rnorm(n)
p <- environment(formula(model))
e <- new.env(parent = p)
e$var1 <- var1
if (is.null(model$formula)) attr(model$terms, ".Environment") <- e
else environment(model$formula) <- e
update(model, . ~ . + var1)
}
2c) Shortest but the most hackish is to stick var1
into the original model
environment:
MyUpdate <- function(model){
mf <- model.frame(model)
n <- nrow(mf)
var1 <- rnorm(n)
if (is.null(model$formula)) attr(model$terms, ".Environment")$var1 <- var1
else environment(model$formula)$var1 <- var1
update(model, . ~ . + var1)
}
3) eval/substitute This solution does use eval
which is sometimes frowned upon. It works on lm
and glm
and on clm
it works except that the output does not display var1
but rather the expression that computes it.
MyUpdate <- function(model) {
m <- eval.parent(substitute(update(model, . ~ . + rnorm(nrow(model.frame(model))))))
m$call$formula <- update(formula(model), . ~ . + var1)
names(m$coefficients)[length(m$coefficient)] <- "var1"
m
}
REVISED Added additional solutions, simplified (1), got solutions in (2) to run all examples in test section.
Some theory. A formula object often has an associated environment:
frm1 <- y~x # a formula created in the global environment ("in the console")
attr(frm1, ".Environment") # see also unclass(frm1)
## <environment: R_GlobalEnv>
Here, the functions acting on frm1
will know that they shall seek for y
and x
in the global environment (unless stated otherwise, see e.g. data
arg of lm()
). On the other hand:
f <- function() { y~x }; frm2 <- f() # a formula created in a function
attr(frm2, ".Environment")
## <environment: 0x2f87e48>
Such a formula points to y
and x
being the "local variables" in f()
.
If you pass a formula created in the global environment to a function, you will is most cases not be able to refer to the objects created within that function.
The solution. The underlying formula and environment is somewhat "hidden" withing the object returned by lm()
. But they can be accessed. The code below should solve your issue.
MyUpdate <- function(model){
assign("randData", data.frame(var1=rnorm(length(model$residuals))),
envir=attr(model1$terms, ".Environment"))
model2 <- update(model, ".~.+randData$var1")
return(model2)
}