Every time I do an aggregate on a data.frame I default to using the "by = list(...)"
parameter. But I do see solutions on stackoverflow and elsewhere where tilde (~) is used in the "formula" parameter. I kinda see the "by" parameter as the "pivot" around these variables.
In some cases, the output is exactly the same. For example:
aggregate(cbind(df$A, df$B, df$C), FUN = sum, by = list("x" = df$D, "y" = df$E))
AND
aggregate(cbind(df$A, df$B, df$C) ~ df$E, FUN = sum)
What is the difference between the two and when do you use which?
From the help page,
So I don't think it really matters. Use whichever approach you're comfortable with, or which fits existing variables and formulas in your workspace.
I would not entirely disagree that it doesn't really matter which approach you use, however, it is important to note that they do behave differently.
I'll illustrate with a small example.
Here's some sample data:
First, the formula interface. The following three commands will all yield the same output.
Here's a related command for the "by" interface. Pretty cumbersome to type (but that can be addressed by using
with
, if required).Now, stop and make note of any differences.
The two that pop into my mind are:
The formula method does a nicer job of preserving
names
but it doesn't let you control the names directly in your command, which you can do in thedata.frame
method:The formula method and the
data.frame
method treatNA
values differently. To get the same result with the formula method as you do with thedata.frame
method, you need to usena.action = na.pass
.Again, it is not entirely wrong to say "I don't think it really matters", and I'm not going to state my preference here since that's not really what Stack Overflow is about, but it is important to always read the function documentation carefully before making such decisions.