I have a data.frame such as this (the real data set has many more rows and columns)
set.seed(15)
dd <- data.frame(id=letters[1:4], matrix(runif(5*4), nrow=4))
# id X1 X2 X3 X4 X5
# 1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437
# 2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670
# 3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871
# 4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125
I would like to be able to write a dplyr statement where I can select a subset of columns and mutate them. (I'm trying to do something similar to using .SDcols in data.table).
For a simplified example, here's the function I would like to be able to write to add columns for the sums and means of the even "X" columns while preserving all other columns. The desired output using base R is
(cols<-paste0("X", c(2,4)))
# [1] "X2" "X4"
cbind(dd,evensum=rowSums(dd[,cols]),evenmean=rowMeans(dd[,cols]))
# id X1 X2 X3 X4 X5 evensum evenmean
# 1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380811
# 2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
# 3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
# 4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478768
but I wanted to use a dplyr-like chain to do the same thing. In the general case, I'd like to be able to use any of select()
's helper functions such as starts_with
, ends_with
, matches
, etc and any function. Here's what I tried
library(dplyr)
partial_mutate1 <- function(x, colspec, ...) {
select_(x, .dots=list(lazyeval::lazy(colspec))) %>%
transmute_(.dots=lazyeval::lazy_dots(...)) %>%
cbind(x,.)
}
dd %>% partial_mutate1(num_range("X", c(2,4)),
evensum=rowSums(.), evenmean=rowMeans(.))
However, This throws an error that says
Error in rowSums(.) : 'x' must be numeric
Which appears to be because .
seems to be referring to the entire date.frame rather than the selected subset. (same error as rowSums(dd)
). However, note that this produces the desired output
partial_mutate2 <- function(x, colspec) {
select_(x, .dots=list(lazyeval::lazy(colspec))) %>%
transmute(evensum=rowSums(.), evenmean=rowMeans(.)) %>%
cbind(x,.)
}
dd %>% partial_mutate2(seq(2,ncol(dd),2))
I'm guessing this is some sort of environment problem? Any suggestions on how to pass the arguments to partial_mutate1
so that the .
will correctly take values from the "select()-ed" dataset?
Am I missing something or would this work as expected:
Or are you specifically looking for a custom function to do this?
Not exactly what you are looking for but if you want to do it inside a pipe you could use
select
explicitly insidemutate
like this:However, it is a bit more complicated if you want to apply several functions. You could use a helper function along the lines of (..not thoroughly tested.. ):
And then apply it like this:
A number-of-columns agnostic approach using dplyr:
In newer versions of dplyr, you can use the new
mutate_at()
function
tidyr::nest()
understands the same selector syntax asdplyr::select()
, so one approach would be to consolidate the columns of interest into a single column-of-dataframes, perform the necessary operations on that column-of-dataframes, and unnest to get back a flat data frame:Since data frames are basically lists, this approach is naturally suited for applying arbitrary functions (such as
sd
above) to arbitrary an set of columns usingpurrr::pmap()
family of functions.Side note: Since
sd
works on vectors, we usepurrr::lift_vd
to convert its interface to be suitable forpmap
: