Pass a data.frame column name to a function

2018-12-31 05:23发布

I'm trying to write a function to accept a data.frame (x) and a column from it. The function performs some calculations on x and later returns another data.frame. I'm stuck on the best-practices method to pass the column name to the function.

The two minimal examples fun1 and fun2 below produce the desired result, being able to perform operations on x$column, using max() as an example. However, both rely on the seemingly (at least to me) inelegant

  1. call to substitute() and possibly eval()
  2. the need to pass the column name as a character vector.

fun1 <- function(x, column){
  do.call("max", list(substitute(x[a], list(a = column))))
}

fun2 <- function(x, column){
  max(eval((substitute(x[a], list(a = column)))))
}

df <- data.frame(B = rnorm(10))
fun1(df, "B")
fun2(df, "B")

I would like to be able to call the function as fun(df, B), for example. Other options I have considered but have not tried:

  • Pass column as an integer of the column number. I think this would avoid substitute(). Ideally, the function could accept either.
  • with(x, get(column)), but, even if it works, I think this would still require substitute
  • Make use of formula() and match.call(), neither of which I have much experience with.

Subquestion: Is do.call() preferred over eval()?

3条回答
低头抚发
2楼-- · 2018-12-31 05:50

You can just use the column name directly:

df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
  max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))

There's no need to use substitute, eval, etc.

You can even pass the desired function as a parameter:

fun1 <- function(x, column, fn) {
  fn(x[,column])
}
fun1(df, "B", max)

Alternatively, using [[ also works for selecting a single column at a time:

df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
  max(x[[column]])
}
fun1(df, "B")
查看更多
浪荡孟婆
3楼-- · 2018-12-31 05:59

Personally I think that passing the column as a string is pretty ugly. I like to do something like:

get.max <- function(column,data=NULL){
    column<-eval(substitute(column),data, parent.frame())
    max(column)
}

which will yield:

> get.max(mpg,mtcars)
[1] 33.9
> get.max(c(1,2,3,4,5))
[1] 5

Notice how the specification of a data.frame is optional. you can even work with functions of your columns:

> get.max(1/mpg,mtcars)
[1] 0.09615385
查看更多
裙下三千臣
4楼-- · 2018-12-31 06:05

This answer will cover many of the same elements as existing answers, but this issue (passing column names to functions) comes up often enough that I wanted there to be an answer that covered things a little more comprehensively.

Suppose we have a very simple data frame:

dat <- data.frame(x = 1:4,
                  y = 5:8)

and we'd like to write a function that creates a new column z that is the sum of columns x and y.

A very common stumbling block here is that a natural (but incorrect) attempt often looks like this:

foo <- function(df,col_name,col1,col2){
      df$col_name <- df$col1 + df$col2
      df
}

#Call foo() like this:    
foo(dat,z,x,y)

The problem here is that df$col1 doesn't evaluate the expression col1. It simply looks for a column in df literally called col1. This behavior is described in ?Extract under the section "Recursive (list-like) Objects".

The simplest, and most often recommended solution is simply switch from $ to [[ and pass the function arguments as strings:

new_column1 <- function(df,col_name,col1,col2){
    #Create new column col_name as sum of col1 and col2
    df[[col_name]] <- df[[col1]] + df[[col2]]
    df
}

> new_column1(dat,"z","x","y")
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12

This is often considered "best practice" since it is the method that is hardest to screw up. Passing the column names as strings is about as unambiguous as you can get.

The following two options are more advanced. Many popular packages make use of these kinds of techniques, but using them well requires more care and skill, as they can introduce subtle complexities and unanticipated points of failure. This section of Hadley's Advanced R book is an excellent reference for some of these issues.

If you really want to save the user from typing all those quotes, one option might be to convert bare, unquoted column names to strings using deparse(substitute()):

new_column2 <- function(df,col_name,col1,col2){
    col_name <- deparse(substitute(col_name))
    col1 <- deparse(substitute(col1))
    col2 <- deparse(substitute(col2))

    df[[col_name]] <- df[[col1]] + df[[col2]]
    df
}

> new_column2(dat,z,x,y)
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12

This is, frankly, a bit silly probably, since we're really doing the same thing as in new_column1, just with a bunch of extra work to convert bare names to strings.

Finally, if we want to get really fancy, we might decide that rather than passing in the names of two columns to add, we'd like to be more flexible and allow for other combinations of two variables. In that case we'd likely resort to using eval() on an expression involving the two columns:

new_column3 <- function(df,col_name,expr){
    col_name <- deparse(substitute(col_name))
    df[[col_name]] <- eval(substitute(expr),df,parent.frame())
    df
}

Just for fun, I'm still using deparse(substitute()) for the name of the new column. Here, all of the following will work:

> new_column3(dat,z,x+y)
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12
> new_column3(dat,z,x-y)
  x y  z
1 1 5 -4
2 2 6 -4
3 3 7 -4
4 4 8 -4
> new_column3(dat,z,x*y)
  x y  z
1 1 5  5
2 2 6 12
3 3 7 21
4 4 8 32

So the short answer is basically: pass data.frame column names as strings and use [[ to select single columns. Only start delving into eval, substitute, etc. if you really know what you're doing.

查看更多
登录 后发表回答