I am wrestling with programming using dplyr
in R to operate on columns of a data frame that are only known by their string names. I know there was recently an update to dplyr
to support quosures and the like and I've reviewed what I think are the relevant components of the new "Programming with dplyr" article here: http://dplyr.tidyverse.org/articles/programming.html. However, I'm still not able to do what I want.
My situation is that I know a column name of a data frame only by its string name. Thus, I can't use non-standard evaluation in a call to dplyr
within a function or even a script where the column name may change between runs because I can't hard-code the unquoted (i.e., "bare") column name generally. I'm wondering how to get around this, and I'm guessing I'm overlooking something with the new quoting/unquoting syntax.
For example, suppose I have user inputs that define cutoff percentiles for a distribution of data. A user may run the code using any percentile he/she would like, and the percentile he/she picks will change the output. Within the analysis, a column in an intermediate data frame is created with the name of the percentile that is used; thus this column's name changes depending on the cutoff percentile input by the user.
Below is a minimal example to illustrate. I want to call the function with various values for the cutoff percentile. I want the data frame named MPGCutoffs
to have a column that is named according to the chosen cutoff quantile (this currently works in the below code), and I want to later operate on this column name. Because of the generality of this column name, I can only know it in terms of the input pctCutoff
at the time of writing the function, so I need a way to operate on it when only knowing the string defined by probColName
, which follows a predefined pattern based on the value of pctCutoff
.
userInput_prob1 <- 0.95
userInput_prob2 <- 0.9
# Function to get cars that have the "best" MPG
# fuel economy, where "best" is defined by the
# percentile cutoff passed to the function.
getBestMPG <- function( pctCutoff ){
# Define new column name to hold the MPG percentile cutoff.
probColName <- paste0('P', pctCutoff*100)
# Compute the MPG percentile cutoff by number of gears.
MPGCutoffs <- mtcars %>%
dplyr::group_by( gear ) %>%
dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )
# Filter mtcars with only MPG values above cutoffs.
output <- mtcars %>%
dplyr::left_join( MPGCutoffs, by='gear' ) %>%
dplyr::filter( mpg > !!probColName ) #****This doesn't run; this is where I'm stuck
# Return filtered data.
return(output)
}
best_1 <- getBestMPG( userInput_prob1 )
best_2 <- getBestMPG( userInput_prob2 )
The dplyr::filter()
statement is what I can't get to run properly. I've tried:
dplyr::filter( mpg > probColName )
- No error, but no rows returned.
dplyr::filter( mpg > !!probColName )
- No error, but no rows returned.
I've also seen examples where I could pass something like quo(P95)
to the function and then unquote it in the call to dplyr::filter()
; I've gotten this to work, but it doesn't solve my problem since it requires hard-coding the variable name outside the function. For example, if I do this and the percentile passed by the user is 0.90, then the call to dplyr::filter()
fails because the column created is named P90
and not P95
.
Any help would be greatly appreciated. I'm hoping there's an easy solution that I'm just overlooking.