Looping over combinations of regression model term

2020-02-13 05:40发布

I'm running a regression in the form

reg=lm(y ~ x1+x2+x3+z1,data=mydata)

In the place of the last term, z1, I want to loop through a set of different variables, z1 through z10, running a regression for each with it as the last term. E.g. in second run I want to use

reg=lm(y ~ x1+x2+x3+z2,data=mydata)

in 3rd run:

reg=lm(y ~ x1+x2+x3+z3,data=mydata)

How can I automate this by looping through the list of z-variables?

标签: r apply lm
4条回答
手持菜刀,她持情操
2楼-- · 2020-02-13 06:20

Depending on what your final goal is, it can be much faster to fit a base model, update it with add1, and extract the F-test/AIC you want:

> basemodel <- lm(y~x1+x2+x3, dat1)
> 
> add1(object=basemodel, grep("z\\d", names(dat1), value=TRUE), test="F")
Single term additions

Model:
y ~ x1 + x2 + x3
       Df Sum of Sq    RSS    AIC F value Pr(>F)
<none>              477.34 164.31               
z1      1    0.0768 477.26 166.29  0.0153 0.9019
z2      1    5.1937 472.15 165.21  1.0450 0.3093

See also ?update for refitting the model.

查看更多
再贱就再见
3楼-- · 2020-02-13 06:22

With this dummy data:

dat1 <- data.frame(y = rpois(100,5),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
z1 = runif(100),
z2 = runif(100)
)

You could get your list of two lm objects this way:

 lapply(dat1[5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))

Which iterates through those two columns and substitutes them as arguments into the lm call.

As Alex notes below, it's preferable to pass the names through the formula, rather than the actual data columns as I have done here.

查看更多
一纸荒年 Trace。
4楼-- · 2020-02-13 06:23

Here's a different approach using packages from the dplyr / tidyr family. It restructures the data to a long form, then uses group_by() from the dplyr package instead of lapply():

library(dplyr)
library(tidyr)
library(magrittr) # for use_series ()
dat1 %>%
  gather(varname, z, z1:z2) %>% # convert data to long form
  group_by(varname) %>%
  do(model = lm(y ~ x1 + x2 + x3 + z, data = .)) %>%
  use_series(model)

This converts the data to a long format using gather, where the z-values occupy the same column. use_series() from the magrittr package return the list of lm objects instead of a data.frame. If you load the broom package, you can extract the model coefficients in this pipeline of code:

library(broom)
dat1 %>%
  gather(varname, z, z1:z2) %>%
  group_by(varname) %>%
  do(model = lm(y ~ x1 + x2 + x3 + z, data = .)) %>%
  glance(model) # or tidy(model)

Source: local data frame [2 x 12]
Groups: varname

  varname  r.squared adj.r.squared    sigma statistic   p.value df    logLik      AIC      BIC deviance df.residual
1      z1 0.06606736    0.02674388 2.075924  1.680099 0.1609905  5 -212.3698 436.7396 452.3707 409.3987          95
2      z2 0.06518852    0.02582804 2.076900  1.656192 0.1666479  5 -212.4168 436.8337 452.4647 409.7840          95

Data:

dat1 <- data.frame(y = rpois(100, 5), x1 = runif(100),
                   x2 = runif(100), x3 = runif(100),
                   z1 = runif(100), z2 = runif(100))
查看更多
【Aperson】
5楼-- · 2020-02-13 06:27

While what Sam has provided works and is a good solution, I would personally prefer to go about it slightly differently. His answer has already been accepted, so I'm just posting this for the sake of completeness.

dat1 <- data.frame(y = rpois(100, 5),
                   x1 = runif(100),
                   x2 = runif(100),
                   x3 = runif(100),
                   z1 = runif(100),
                   z2 = runif(100))

lapply(colnames(dat1)[5:6],
       function(x, d) lm(as.formula(paste("y ~ x1 + x2 + x3", x, sep = " + ")), data = d),
       d = dat1)

Rather than looping over the actual columns of the data frame, this loops only over the string of names. This provides some speed improvements as fewer things are copied between iterations.

library(microbenchmark)

microbenchmark({ lapply(<what I wrote above>) })
# Unit: milliseconds
# expr
# {lapply(colnames(dat1)[5:6], function(x, d) lm(as.formula(paste("y ~ x1 + x2 + x3", x, sep = "+")), data = d), d = dat1)}
#       min       lq     mean   median       uq      max neval
#  4.014237 4.148117 4.323387 4.220189 4.281995 5.898811   100

microbenchmark({ lapply(<other answer>) })
# Unit: milliseconds
# expr
# {lapply(dat1[, 5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))}
#       min       lq     mean   median       uq    max neval
#  4.391494 4.505056 5.186972 4.598301 4.698818 51.573   100

The difference is fairly small for this toy example, but as the number of observations and predictors increases, the difference will likely become more pronounced.

查看更多
登录 后发表回答