I have a data frame with about 200 columns, out of them I want to group the table by first 10 or so which are factors and sum the rest of the columns.
I have list of all the column names which I want to group by and the list of all the cols which I want to aggregate.
The output format that I am looking for needs to be the same dataframe with same number of cols, just grouped together.
Is there a solution using packages data.table
, plyr
or any other?
The
dplyr
way would be:You can further specify the columns to be summarised or excluded from the
summarise_each
by using the special functions mentioned in the help file of?dplyr::select
.Another way to do this with dplyr that would be generic (don't need list of columns) would be:
The data.table way is :
or
where
.SD
is the (S)ubset of (D)ata excluding group columns. (Aside: If you need to refer to group columns generically, they are in.BY
.)In base R this would be...
EDIT: The aggregate function has come a long way since I wrote this. None of the casting above is necessary.
And there are a variety of ways to write this. Assuming the first 10 columns are named
a1
througha10
I like the following, even though it is verbose.(You could use paste to construct the formula and use
formula
)This seems like a task for ddply (I use the 'baseball' dataset which is included with plyr):
This gives per groupColumns the sum of the columns specified in dataColumns.
Using plyr::ddply: