I have a data.frame such as this (the real data set has many more rows and columns)
set.seed(15)
dd <- data.frame(id=letters[1:4], matrix(runif(5*4), nrow=4))
# id X1 X2 X3 X4 X5
# 1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437
# 2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670
# 3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871
# 4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125
I would like to be able to write a dplyr statement where I can select a subset of columns and mutate them. (I'm trying to do something similar to using .SDcols in data.table).
For a simplified example, here's the function I would like to be able to write to add columns for the sums and means of the even "X" columns while preserving all other columns. The desired output using base R is
(cols<-paste0("X", c(2,4)))
# [1] "X2" "X4"
cbind(dd,evensum=rowSums(dd[,cols]),evenmean=rowMeans(dd[,cols]))
# id X1 X2 X3 X4 X5 evensum evenmean
# 1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380811
# 2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
# 3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
# 4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478768
but I wanted to use a dplyr-like chain to do the same thing. In the general case, I'd like to be able to use any of select()
's helper functions such as starts_with
, ends_with
, matches
, etc and any function. Here's what I tried
library(dplyr)
partial_mutate1 <- function(x, colspec, ...) {
select_(x, .dots=list(lazyeval::lazy(colspec))) %>%
transmute_(.dots=lazyeval::lazy_dots(...)) %>%
cbind(x,.)
}
dd %>% partial_mutate1(num_range("X", c(2,4)),
evensum=rowSums(.), evenmean=rowMeans(.))
However, This throws an error that says
Error in rowSums(.) : 'x' must be numeric
Which appears to be because .
seems to be referring to the entire date.frame rather than the selected subset. (same error as rowSums(dd)
). However, note that this produces the desired output
partial_mutate2 <- function(x, colspec) {
select_(x, .dots=list(lazyeval::lazy(colspec))) %>%
transmute(evensum=rowSums(.), evenmean=rowMeans(.)) %>%
cbind(x,.)
}
dd %>% partial_mutate2(seq(2,ncol(dd),2))
I'm guessing this is some sort of environment problem? Any suggestions on how to pass the arguments to partial_mutate1
so that the .
will correctly take values from the "select()-ed" dataset?
Am I missing something or would this work as expected:
cols <- paste0("X", c(2,4))
dd %>% mutate(evensum = rowSums(.[cols]), evenmean = rowMeans(.[cols]))
# id X1 X2 X3 X4 X5 evensum evenmean
#1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380811
#2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
#3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
#4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478768
Or are you specifically looking for a custom function to do this?
Not exactly what you are looking for but if you want to do it inside a pipe you could use select
explicitly inside mutate
like this:
dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% rowSums)
# id X1 X2 X3 X4 X5 xy
#1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623
#2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878
#3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071
#4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535
However, it is a bit more complicated if you want to apply several functions. You could use a helper function along the lines of (..not thoroughly tested.. ):
f <- function(x, ...) {
n <- nrow(x)
x <- lapply(list(...), function(y) if (length(y) == 1L) rep(y, n) else y)
matrix(unlist(x), nrow = n, byrow = FALSE)
}
And then apply it like this:
dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% f(., rowSums(.), max(.)))
# id X1 X2 X3 X4 X5 xy.1 xy.2
#1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.9888592
#2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.9888592
#3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.9888592
#4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.9888592
A number-of-columns agnostic approach using dplyr:
dd %>%
select(-id) %>%
mutate(evensum = rowSums(.[,1:length(.[1,])%%2==0]),
evenmean = rowMeans(.[,1:length(.[1,])%%2==0])) %>%
cbind(id=dd[,1],.)
id X1 X2 X3 X4 X5 evensum evenmean
1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380812
2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478767
tidyr::nest()
understands the same selector syntax as dplyr::select()
, so one approach would be to consolidate the columns of interest into a single column-of-dataframes, perform the necessary operations on that column-of-dataframes, and unnest to get back a flat data frame:
library( tidyverse )
dd %>% nest( X2, X4, .key="Slice" ) %>%
mutate( evensum = map(Slice, rowSums),
evenmean = map(Slice, rowMeans),
evensd = map(Slice, pmap_dbl, lift_vd(sd)) ) %>%
unnest
# id X1 X3 X5 evensum evenmean evensd X2 X4
# 1 a 0.602 0.687 0.447 0.876 0.438 0.100 0.367 0.509
# 2 b 0.195 0.831 0.965 1.70 0.848 0.200 0.989 0.707
# 3 c 0.966 0.105 0.141 1.68 0.839 0.0333 0.815 0.862
# 4 d 0.651 0.646 0.777 1.10 0.548 0.416 0.254 0.842
Since data frames are basically lists, this approach is naturally suited for applying arbitrary functions (such as sd
above) to arbitrary an set of columns using purrr::pmap()
family of functions.
Side note: Since sd
works on vectors, we use purrr::lift_vd
to convert its interface to be suitable for pmap
:
sd( c(0.367, 0.509) ) # 0.100
lift_vd(sd)( 0.367, .509 ) # 0.100
In newer versions of dplyr, you can use the new mutate_at()
function
mutate_at(dd, vars(starts_with("X")), somefunction)