R: Evaluate an expression in a data frame with arg

2020-03-21 10:15发布

问题:

I want to write a function that evaluates an expression in a data frame, but one that does so using expressions that may or may not contain user-defined objects. I think the magic word is "non-standard evaluation", but I cannot quite figure it out just yet.

One simple example (yet realistic for my purposes): Say, I want to evaluate an lm() call for variables found in a data frame.

mydf <- data.frame(x=1:10, y=1:10)

A function that does so can be written as follows:

f <- function(df, expr){
  expr <- substitute(expr)
  pf <- parent.frame()
  eval(expr, df, pf)
}

Such that I get what I want using the following command.

f(mydf, lm(y~x))

# Call:
# lm(formula = y ~ x)
# 
# Coefficients:
# (Intercept)            x  
#    1.12e-15     1.00e+00  

Nice. However, there are cases in which it is more convenient to save the model equation in an object before calling lm(). Unfortunately the above function no longer does it.

fml <- y~x

f(mydf, lm(fml))
# Error in eval(expr, envir, enclos): object 'y' not found

Can someone explain why the second call doesn't work? How could the function be altered, such that both calls would lead to the desired results? (desired=fitted model)

Cheers!

回答1:

From ?lm, re data argument:

If not found in data, the variables are taken from environment(formula)

In your first case, the formula is created in your eval(expr, df, pf) call, so the environment of the formula is an environment based on df. In the second case, the formula is created in the global environment, which is why it doesn't work.

Because formulas come with their own environment, they can be tricky to handle in NSE.

You could try:

with(mydf,
  {
    print(lm(y~x))
    fml <- y~x
    print(lm(fml))
  }
)

but that probably isn't ideal for you. Short of checking whether any names in the captured parameter resolve to formulas, and re-assigning their environments, you'll have some trouble. Worse, it isn't even necessarily obvious that re-assigning the environment is the right thing to do. In many cases, you do want to look in the formula environment.

There was a loosely related discussion on this issue on R Chat:

  • Ben Bolker outlines an issue
  • Josh O'Brien points to some old references