I have a data frame with 900,000 rows and 11 columns in R. The column names and types are as follows:
column name: date / mcode / mname / ycode / yname / yissue / bsent / breturn / tsent / treturn / csales
type: Date / Char / Char / Char / Char / Numeric / Numeric / Numeric / Numeric / Numeric / Numeric
I want to calculate the subtotals. For example, I want to calculate the sums at each change in yname, and add subtotal to all numerical variables. There are 160 distinct ynames, so the resulting table should tell me the subtotal of each yname. I haven't sorted the data yet, but this is not a problem because I can sort the data in whatever way I want. Below is a snippet from my data:
date mcode mname ycode yname yissue bsent breturn tsent treturn csales
417572 2010-07-28 45740 ENDPOINT A 5772 XMAG 20100800 7 0 7 0 0
417573 2010-07-31 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0 0 1
417574 2010-08-04 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0 0 1
417575 2010-08-14 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0 0 1
417576 2010-08-26 45740 ENDPOINT A 5772 XMAG 20100800 0 4 0 0 0
417577 2010-07-28 45741 ENDPOINT L 5772 XMAG 20100800 2 0 2 0 0
417578 2010-08-04 45741 ENDPOINT L 5772 XMAG 20100800 2 0 2 0 0
417579 2010-08-26 45741 ENDPOINT L 5772 XMAG 20100800 0 4 0 0 0
417580 2010-07-28 46390 ENDPOINT R 5772 XMAG 20100800 3 0 3 0 1
417581 2010-07-29 46390 ENDPOINT R 5772 XMAG 20100800 0 0 0 0 2
417582 2010-08-01 46390 ENDPOINT R 5779 YMAG 20100800 3 0 3 0 0
417583 2010-08-11 46390 ENDPOINT R 5779 YMAG 20100800 0 0 0 0 1
417584 2010-08-20 46390 ENDPOINT R 5779 YMAG 20100800 0 0 0 0 1
417585 2010-08-24 46390 ENDPOINT R 5779 YMAG 20100800 2 0 2 0 1
417586 2010-08-26 46390 ENDPOINT R 5779 YMAG 20100800 0 2 0 2 0
417587 2010-07-28 46411 ENDPOINT D 5779 YMAG 20100800 6 0 6 0 0
417588 2010-08-08 46411 ENDPOINT D 5779 YMAG 20100800 0 0 0 0 1
417589 2010-08-11 46411 ENDPOINT D 5779 YMAG 20100800 0 0 0 0 1
417590 2010-08-26 46411 ENDPOINT D 5779 YMAG 20100800 0 4 0 4 0
What function should I use here? Maybe something like SQL group by
?
Or the
plyr
library, which is easily extensible to other data classes:There is a R package called sqldf that enables you to use SQL commands on R data.frames. Besides like you already said, GROUP BY would be nice. You can easily store your data in a local MySQL database and connect to R using the package RMySQL (You can use most other DBMS too but MySQL is the easiest to set up).
As far as I can judge it plyr is a great package, too. But from the way you ask and compare your problem to GROUP BY, I guess you know something about SQL, so using this might be easier for you. There are comfortable functions like dbReadTable, plus if your data grows bigger you can select only subparts of your data to only run your analysis with what you really need.
if your data is large and speed matters, i would recommend using the R function rowsum, which is a lot faster. i applied the 3 methods (f1 = aggregate, f2 = ddply, f3 = tapply) suggested in the answers to compare it with f4 = rowsum and here is what i find:
i have added my code below if someone wants to explore in more detail.
You can use
aggregate
For instance, say that you have
Then you can do
OK. Assuming your data are in a data frame named
foo
:Then this will do the aggregation of the numeric columns in your data:
That was using the snippet of data you included in your Q. I used the formula interface to
aggregate()
, which is a bit nicer in this instance because you don't need all thefoo$
bits on the variable names you wish the aggregate. If you have missing data (NA
)in your full data set, then you'll need add an extra argumentna.rm = TRUE
which will get passed tosum()
, like so:You can also use
xtabs
ortapply
: