I have the following data frame
x <- read.table(text = " id1 id2 val1 val2
1 a x 1 9
2 a x 2 4
3 a y 3 5
4 a y 4 9
5 b x 1 7
6 b y 4 4
7 b x 3 9
8 b y 2 8", header = TRUE)
I want to calculate the mean of val1 and val2 grouped by id1 and id2, and simultaneously count the number of rows for each id1-id2 combination. I can perform each calculation separately:
# calculate mean
aggregate(. ~ id1 + id2, data = x, FUN = mean)
# count rows
aggregate(. ~ id1 + id2, data = x, FUN = length)
In order to do both calculations in one call, I tried
do.call("rbind", aggregate(. ~ id1 + id2, data = x, FUN = function(x) data.frame(m = mean(x), n = length(x))))
However, I get a garbled output along with a warning:
# m n
# id1 1 2
# id2 1 1
# 1.5 2
# 2 2
# 3.5 2
# 3 2
# 6.5 2
# 8 2
# 7 2
# 6 2
# Warning message:
# In rbind(id1 = c(1L, 2L, 1L, 2L), id2 = c(1L, 1L, 2L, 2L), val1 = list( :
# number of columns of result is not a multiple of vector length (arg 1)
I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.
How can I use aggregate
or other functions to perform several calculations in one call?
You could add a
count
column, aggregate withsum
, then scale back to get themean
:It has the advantage of preserving your column names and creating a single
count
column.Perhaps you want to merge?
Another
dplyr
option isacross
which is part of current dev versionResult
You can do it all in one step and get proper labeling:
This creates a dataframe with two id columns and two matrix columns:
As pointed out by @lord.garbage below, this can be converted to a dataframe with "simple" columns by using
do.call(data.frame, ...)
This is the syntax for multiple variables on the LHS:
Using the
dplyr
package you could achieve this by usingsummarise_all
. With this summarise-function you can apply other functions (in this casemean
andn()
) to each of the non-grouping columns:which gives:
If you don't want to apply the function(s) to all non-grouping columns, you specify the columns to which they should be applied or by excluding the non-wanted with a minus using the
summarise_at()
function:Given this in the question :
Then in
data.table
(1.9.4+
) you could try :For timings comparing
aggregate
(used in question and all 3 other answers) todata.table
see this benchmark (theagg
andagg.x
cases).