R: What are the best functions to deal with concat

2020-03-05 03:40发布

问题:

I have a data.frame from this code:

   my_df = data.frame("read_time" = c("2010-02-15", "2010-02-15", 
                                      "2010-02-16", "2010-02-16", 
                                       "2010-02-16", "2010-02-17"), 
                      "OD" = c(0.1, 0.2, 0.1, 0.2, 0.4, 0.5) )

which produces this:

> my_df
   read_time  OD
1 2010-02-15 0.1
2 2010-02-15 0.2
3 2010-02-16 0.1
4 2010-02-16 0.2
5 2010-02-16 0.4
6 2010-02-17 0.5

I want to average the OD column over each distinct read_time (notice some are replicated others are not) and I also would like to calculate the standard deviation, producing a table like this:

> my_df
   read_time  OD        stdev
1 2010-02-15 0.15       0.05
5 2010-02-16 0.3         0.1
6 2010-02-17 0.5         0

Which are the best functions to deal with concatenating such values in a data.frame?

回答1:

The plyr package is popular for this, but the base functions by() and aggregate() will also help.

> ddply(my_df, "read_time", function(X) data.frame(OD=mean(X$OD),stdev=sd(X$OD)))
   read_time      OD   stdev
1 2010-02-15 0.15000 0.07071
2 2010-02-16 0.23333 0.15275
3 2010-02-17 0.50000      NA

You can add the missing bit to return 0 instead of NA for the last std.dev.

Also, you don't need the quotes (on the variables) you had in the data.frame construction.



回答2:

You can try the package data.table. If you know MySQL it should be very easy for you to get all the functions, otherwise the basics are good enough too ;-)

my_dfdt<-data.table(my_df)
mean<-my_dfdt[,mean(OD), by="read_time"]
sd<-  ..  

you can also join both in one line or to cbind at the end, your call of style

Another advantage: it is extremely fast, if you have large samples. Very fast...see documentation why.



回答3:

This illustrates how you could use aggregate to get the mean and standard deviation by your read_time.

>aggregate(my_df$OD, by=list(my_df$read_time), function(x) mean(x))

     Group.1         x
1 2010-02-15 0.1500000
2 2010-02-16 0.2333333
3 2010-02-17 0.5000000


>aggregate(my_df$OD, by=list(my_df$read_time), function(x) sd(x))
     Group.1          x
1 2010-02-15 0.07071068
2 2010-02-16 0.15275252
3 2010-02-17         NA