R dplyr operate on a column known only by its stri

2019-01-20 05:48发布

I am wrestling with programming using dplyr in R to operate on columns of a data frame that are only known by their string names. I know there was recently an update to dplyr to support quosures and the like and I've reviewed what I think are the relevant components of the new "Programming with dplyr" article here: http://dplyr.tidyverse.org/articles/programming.html. However, I'm still not able to do what I want.

My situation is that I know a column name of a data frame only by its string name. Thus, I can't use non-standard evaluation in a call to dplyr within a function or even a script where the column name may change between runs because I can't hard-code the unquoted (i.e., "bare") column name generally. I'm wondering how to get around this, and I'm guessing I'm overlooking something with the new quoting/unquoting syntax.

For example, suppose I have user inputs that define cutoff percentiles for a distribution of data. A user may run the code using any percentile he/she would like, and the percentile he/she picks will change the output. Within the analysis, a column in an intermediate data frame is created with the name of the percentile that is used; thus this column's name changes depending on the cutoff percentile input by the user.

Below is a minimal example to illustrate. I want to call the function with various values for the cutoff percentile. I want the data frame named MPGCutoffs to have a column that is named according to the chosen cutoff quantile (this currently works in the below code), and I want to later operate on this column name. Because of the generality of this column name, I can only know it in terms of the input pctCutoff at the time of writing the function, so I need a way to operate on it when only knowing the string defined by probColName, which follows a predefined pattern based on the value of pctCutoff.

userInput_prob1 <- 0.95
userInput_prob2 <- 0.9

# Function to get cars that have the "best" MPG
# fuel economy, where "best" is defined by the
# percentile cutoff passed to the function.
getBestMPG <- function( pctCutoff ){

  # Define new column name to hold the MPG percentile cutoff.
  probColName <- paste0('P', pctCutoff*100)

  # Compute the MPG percentile cutoff by number of gears.
  MPGCutoffs <- mtcars %>%
    dplyr::group_by( gear ) %>%
    dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )

  # Filter mtcars with only MPG values above cutoffs.
  output <- mtcars %>%
    dplyr::left_join( MPGCutoffs, by='gear' ) %>%
    dplyr::filter( mpg > !!probColName ) #****This doesn't run; this is where I'm stuck

  # Return filtered data.
  return(output)
}

best_1 <- getBestMPG( userInput_prob1 )
best_2 <- getBestMPG( userInput_prob2 )

The dplyr::filter() statement is what I can't get to run properly. I've tried:

dplyr::filter( mpg > probColName ) - No error, but no rows returned.

dplyr::filter( mpg > !!probColName ) - No error, but no rows returned.

I've also seen examples where I could pass something like quo(P95) to the function and then unquote it in the call to dplyr::filter(); I've gotten this to work, but it doesn't solve my problem since it requires hard-coding the variable name outside the function. For example, if I do this and the percentile passed by the user is 0.90, then the call to dplyr::filter() fails because the column created is named P90 and not P95.

Any help would be greatly appreciated. I'm hoping there's an easy solution that I'm just overlooking.

2条回答
做自己的国王
2楼-- · 2019-01-20 06:09

Here's an alternate solution from Hadley's comment in the post referred to in MrFlick's answer (https://github.com/tidyverse/rlang/issues/116). Using as.name() from base R takes the place of rlang::sym(), and you still do need to unquote it. That is, the following also works:

dplyr::filter( mpg > !!as.name(probColName) )

查看更多
Anthone
3楼-- · 2019-01-20 06:11

If you have a column name in a string (aka character vector) and you want to use it with tidyeval, then you can covert it with rlang::sym(). Just change

dplyr::filter( mpg > !!rlang::sym(probColName) )

and it should work. This is taken from the recommendation at this github issue: https://github.com/tidyverse/rlang/issues/116

It's still fine to use

dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )

because when dynamically setting a parameter name, you just need the string and not an unqouted symbol.

查看更多
登录 后发表回答