I can achieve this task, but I feel like there must be a "best" (slickest, most compact, clearest-code, fastest?) way of doing it and have not figured it out so far ...
For a specified set of categorical factors I want to construct a table of means and variances by group.
generate data:
set.seed(1001)
d <- expand.grid(f1=LETTERS[1:3],f2=letters[1:3],
f3=factor(as.character(as.roman(1:3))),rep=1:4)
d$y <- runif(nrow(d))
d$z <- rnorm(nrow(d))
desired output:
f1 f2 f3 y.mean y.var
1 A a I 0.6502307 0.09537958
2 A a II 0.4876630 0.11079670
3 A a III 0.3102926 0.20280568
4 A b I 0.3914084 0.05869310
5 A b II 0.5257355 0.21863126
6 A b III 0.3356860 0.07943314
... etc. ...
using aggregate
/merge
:
library(reshape)
m1 <- aggregate(y~f1*f2*f3,data=d,FUN=mean)
m2 <- aggregate(y~f1*f2*f3,data=d,FUN=var)
mvtab <- merge(rename(m1,c(y="y.mean")),
rename(m2,c(y="y.var")))
using ddply
/summarise
(possibly best but haven't been able to make it work):
mvtab2 <- ddply(subset(d,select=-c(z,rep)),
.(f1,f2,f3),
summarise,numcolwise(mean),numcolwise(var))
results in
Error in output[[var]][rng] <- df[[var]] :
incompatible types (from closure to logical) in subassignment type fix
using melt
/cast
(maybe best?)
mvtab3 <- cast(melt(subset(d,select=-c(z,rep)),
id.vars=1:3),
...~.,fun.aggregate=c(mean,var))
## now have to drop "variable"
mvtab3 <- subset(mvtab3,select=-variable)
## also should rename response variables
Won't (?) work in reshape2
. Explaining ...~.
to someone could be tricky!
I'm a bit puzzled. Does this not work:
This give me something like this:
Which is in the right form, but it looks like the values are different that what you specified.
Edit
Here's how to make your version with
numcolwise
work:You forgot to pass the actual data to
numcolwise
. And then there's the littleddply
trick that each piece is calledpiece
internally. (Which Hadley points out in the comments shouldn't be relied upon as it may change in future versions ofplyr
.)Here is a solution using
data.table
I'm slightly addicted to speed comparisons even though they're largely irrelevant for me in this situation ...
aggregate
is fastest (even faster thandata.table
, which is a surprise to me, although things might be different with a bigger table to aggregate), even using the formula interface ...)(Now I just need Dirk to step up and post an
Rcpp
solution that is 1000 times faster than anything else ...)(I voted for Joshua's.) Here's an Hmisc::summary.formula solution. The advantage of this for me is that it is well integrated with the Hmisc::latex output "channel".
snipped output to show the latex -> PDF -> png output:
I find the doBy package has some very convenient functions for things like this. For example, the function ?summaryBy is quite handy. Consider:
So the function call is simple, easy to use, and I would say, elegant.
Now, if your primary concern is speed, it seems that it would be reasonable--at least with smaller sized tasks (note that I couldn't get the
ramnath_datatable
function to work for whatever reason):I've came accross with this question and found the benchmarks are done with small tables, so it's hard to tell which method is better with 100 rows.
I've also modified the data a bit also to make it "unsorted", this would be a more common case, for example as the data is in a DB. I've added a few more data.table trials to see if setting a key is faster beforehand. It seems here, setting the key beforehand doesn't improve much the performance, so ramnath solution seems to be the fastest.