I have a data.frame from this code:
my_df = data.frame("read_time" = c("2010-02-15", "2010-02-15",
"2010-02-16", "2010-02-16",
"2010-02-16", "2010-02-17"),
"OD" = c(0.1, 0.2, 0.1, 0.2, 0.4, 0.5) )
which produces this:
> my_df
read_time OD
1 2010-02-15 0.1
2 2010-02-15 0.2
3 2010-02-16 0.1
4 2010-02-16 0.2
5 2010-02-16 0.4
6 2010-02-17 0.5
I want to average the OD column over each distinct read_time (notice some are replicated others are not) and I also would like to calculate the standard deviation, producing a table like this:
> my_df
read_time OD stdev
1 2010-02-15 0.15 0.05
5 2010-02-16 0.3 0.1
6 2010-02-17 0.5 0
Which are the best functions to deal with concatenating such values in a data.frame?
The plyr package is popular for this, but the base functions by()
and aggregate()
will also help.
> ddply(my_df, "read_time", function(X) data.frame(OD=mean(X$OD),stdev=sd(X$OD)))
read_time OD stdev
1 2010-02-15 0.15000 0.07071
2 2010-02-16 0.23333 0.15275
3 2010-02-17 0.50000 NA
You can add the missing bit to return 0 instead of NA for the last std.dev.
Also, you don't need the quotes (on the variables) you had in the data.frame construction.
You can try the package data.table. If you know MySQL it should be very easy for you to get all the functions, otherwise the basics are good enough too ;-)
my_dfdt<-data.table(my_df)
mean<-my_dfdt[,mean(OD), by="read_time"]
sd<- ..
you can also join both in one line or to cbind at the end, your call of style
Another advantage: it is extremely fast, if you have large samples. Very fast...see documentation why.
This illustrates how you could use aggregate
to get the mean and standard deviation by your read_time
.
>aggregate(my_df$OD, by=list(my_df$read_time), function(x) mean(x))
Group.1 x
1 2010-02-15 0.1500000
2 2010-02-16 0.2333333
3 2010-02-17 0.5000000
>aggregate(my_df$OD, by=list(my_df$read_time), function(x) sd(x))
Group.1 x
1 2010-02-15 0.07071068
2 2010-02-16 0.15275252
3 2010-02-17 NA