Let's say I have two columns of data. The first contains categories such as "First", "Second", "Third", etc. The second has numbers which represent the number of times I saw "First".
For example:
Category Frequency
First 10
First 15
First 5
Second 2
Third 14
Third 20
Second 3
I want to sort the data by Category and sum the Frequencies:
Category Frequency
First 30
Second 5
Third 34
How would I do this in R?
I find
ave
very helpful (and efficient) when you need to apply different aggregation functions on different columns (and you must/want to stick on base R) :e.g.
Given this input :
we want to group by
Categ1
andCateg2
and compute the sum ofSamples
and mean ofFreq
.Here's a possible solution using
ave
:Result :
More recently, you can also use the dplyr package for that purpose:
Or, for multiple summary columns (works with one column too):
Update for dplyr >= 0.5:
summarise_each
has been replaced bysummarise_all
,summarise_at
andsummarise_if
family of functions in dplyr.Or, if you have multiple columns to group by, you can specify all of them in the
group_by
separated with commas:For more information, including the
%>%
operator, see the introduction to dplyr.While I have recently become a convert to
dplyr
for most of these types of operations, thesqldf
package is still really nice (and IMHO more readable) for some things.Here is an example of how this question can be answered with
sqldf
This is somewhat related to this question.
You can also just use the by() function:
Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.
The recently added
dplyr::tally()
now makes this easier than ever:The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:
Let's compare that to the same thing using data.frame and the above above:
And if you want to keep the column this is the syntax:
The difference will become more noticeable with larger datasets, as the code below demonstrates:
For multiple aggregations, you can combine
lapply
and.SD
as follows