Aggregate sum and mean in R with ddply

2019-06-22 04:53发布

问题:

My data frame has two columns that are used as a grouping key, 17 columns that need to be summed in each group, and one column that should be averaged instead. Let me illustrate this on a different data frame, diamonds from ggplot2.

I know I could do it like this:

ddply(diamonds, ~cut, summarise, x=sum(x), y=sum(y), z=sum(z), price=mean(price))

But while it is reasonable for 3 columns, it is unacceptable for 17 of them.

When researching this, I found the colwise function, but the best I came up with is this:

cbind(ddply(diamonds, ~cut, colwise(sum, 7:9)), price=ddply(diamonds, ~cut, summarise, mean(price))[,2])

Is there a possibility to improve this even further? I would like to do it in a more straightforward way, something like (imaginary commands):

ddply(diamonds, ~cut, colwise(sum, 7:9), price=mean(price))

or:

ddply(diamonds, ~cut, colwise(sum, 7:9), colwise(mean, ~price))

To sum up:

  • I don't want to have to type all 17 columns explicitly, like the first example does with x, y, and z.
  • Ideally, I would like to do it with a single call to ddply, without resorting to cbind (or similar functions), as in the second example.

For reference, the result I expect is 5 rows and 5 columns:

        cut         x         y        z    price
1      Fair  10057.50   9954.07  6412.26 4358.758
2      Good  28645.08  28703.75 17855.42 3928.864
3 Very Good  69359.09  69713.45 43009.52 3981.760
4   Premium  82385.88  81985.82 50297.49 4584.258
5     Ideal 118691.07 118963.24 73304.61 3457.542

回答1:

Antoher solution using dplyr. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.

library(dplyr)
library(ggplot2)

diamonds %>%
    group_by(cut) %>%
    summarise_each(funs(sum, mean), x:z, price) %>%
    select(cut, matches("[xyz]_sum"), price_mean)


回答2:

I would like to suggest data.table solutions for this. You can easily predefine the columns you want operate either by position or by names and then reuse the same code no matter how many column you want to operate on.

Predifine column names

Sums <- 7:9
Means <- "price"

Run the code

library(data.table)
data.table(diamonds)[, c(lapply(.SD[, Sums, with = FALSE], sum),
                         lapply(.SD[, Means, with = FALSE], mean))
                     , by = cut]

#          cut         x         y        z    price
# 1:     Ideal 118691.07 118963.24 73304.61 3457.542
# 2:   Premium  82385.88  81985.82 50297.49 4584.258
# 3:      Good  28645.08  28703.75 17855.42 3928.864
# 4: Very Good  69359.09  69713.45 43009.52 3981.760
# 5:      Fair  10057.50   9954.07  6412.26 4358.758

For your specific example, this could simplified to just

data.table(diamonds)[, c(lapply(.SD[, 7:9, with = FALSE], sum), pe = mean(price)), by = cut]
#          cut         x         y        z       pe
# 1:     Ideal 118691.07 118963.24 73304.61 3457.542
# 2:   Premium  82385.88  81985.82 50297.49 4584.258
# 3:      Good  28645.08  28703.75 17855.42 3928.864
# 4: Very Good  69359.09  69713.45 43009.52 3981.760
# 5:      Fair  10057.50   9954.07  6412.26 4358.758


回答3:

Yet another approach (in my opinion easier to read) for your particular case (mean = sum/n!)

nCut <- ddply(diamonds, ~cut, nrow)
res <- ddply(diamonds, ~cut, colwise(sum, 6:9))
res$price <- res$price/nCut$V1

or the more generic,

do.call(merge, 
    lapply(c(colwise(sum, 7:9), colwise(mean, 6)), 
           function(cw) ddply(diamonds, ~cut, cw)))


回答4:

Just to throw in another solution:

library(plyr)
library(ggplot2)
trans <- list(mean = 8:10, sum = 7)

makeList <- function(inL, mdat = diamonds, by = ~cut) {
   colN <- names(mdat)
   args <- unlist(llply(names(inL), function(n) {
      llply(inL[[n]], function(x) {
         ret <- list(call(n, as.symbol(colN[[x]])))
         names(ret) <- paste(n, colN[[x]], sep = ".")
         ret
      })
   }))
   args$.data <- as.symbol(deparse(substitute(mdat)))
   args$.variables <- by
   args$.fun <- as.symbol("summarise")
   args
}

do.call(ddply, makeList(trans))
#         cut   mean.x   mean.y   mean.z sum.price
# 1      Fair 6.246894 6.182652 3.982770   7017600
# 2      Good 5.838785 5.850744 3.639507  19275009
# 3 Very Good 5.740696 5.770026 3.559801  48107623
# 4   Premium 5.973887 5.944879 3.647124  63221498
# 5     Ideal 5.507451 5.520080 3.401448  74513487

The idea is that the function makeList creates an argument list for ddply. In this way you can quite easily add terms to the list (as function.name = column.indices) and ddply will work as expected:

trans <- c(trans, sd = list(9:10))
do.call(ddply, makeList(trans))
#         cut   mean.x   mean.y   mean.z sum.price      sd.y      sd.z
# 1      Fair 6.246894 6.182652 3.982770   7017600 0.9563804 0.6516384
# 2      Good 5.838785 5.850744 3.639507  19275009 1.0515353 0.6548925
# 3 Very Good 5.740696 5.770026 3.559801  48107623 1.1029236 0.7302281
# 4   Premium 5.973887 5.944879 3.647124  63221498 1.2597511 0.7311610
# 5     Ideal 5.507451 5.520080 3.401448  74513487 1.0744953 0.6576481


回答5:

It uses dplyr, but I believe this will accomplish the specified aim completely in reasonably easy to read syntax:

diamonds %>%
  group_by(cut) %>%
  select(x:z) %>%
  summarize_each(funs(sum)) %>%
  merge(diamonds %>%
          group_by(cut) %>%
          summarize(price = mean(price))
        ,by = "cut")

The only "trick" is that there is a piped expression inside of the merge that handles the calculation of the mean price separately from the calculation of sums.

I benchmarked this solution against the solution provided by @David Arenburg (using data.table) and @thothal (using plyr as requested by the question) with 5000 replications. Here data.table came out slower than plyr and dplyr. dplyr was faster than plyr. One imagines that the benchmark results could change as a function of the number of columns, number of levels in the grouping factor, and particular functions applied. For example, MarkusN submitted an answer after I did my initial benchmarks that is substantially faster than the previously submitted answers for the sample data. He accomplishes this by calculating many summary statistics that aren't desired and then throwing them away... surely there must be a point at which the costs of that approach outweigh the advantages.

       test replications elapsed relative user.self sys.self user.child sys.child
2 dataTable         5000 119.686    2.008   119.611    0.127          0         0
1     dplyr         5000  59.614    1.000    59.676    0.004          0         0
3      plyr         5000  68.505    1.149    68.493    0.064          0         0
?      MarkusN      5000  23.172    ?????    23.926        0          0         0

Certainly speed is not the only consideration. In particular, dplyr and plyr are picky about the order in which they are loaded (plyr before dplyr) and have several functions that mask each other.



回答6:

Not 100% what you are looking for but it might give you another idea on how to do it. Using data.table you can do something like this:

diamonds2[, .(c = sum(c), p = sum(p), ce = sum(ce), pe = mean(pe)), by = cut]

To shorten the code (what you tried to do with colwise), you probably have to write some functions to achieve exactly what you want.



回答7:

For completeness, here's a solution based on dplyr and answers posted by Veerendra Gadekar in another question and here by MarkusN.

In this particular case, it's possible to first apply sum to some of the columns and then mean to all columns of interest:

diamonds %>%
  group_by(cut) %>%
  mutate_each('sum', 8:10) %>%
  summarise_each('mean', 8:10, price)

This is possible, because mean won't change the calculated sums of columns 8:10 and will calculate the required mean of prices. But if we wanted to calculate standard deviation of prices instead of mean, this approach wouldn't work as columns 8:10 would all be 0.

A more general approach could be:

diamonds %>%
   group_by(cut) %>%
   mutate_each('sum', 8:10) %>%
   mutate_each('mean', price) %>%
   summarise_each('first', 8:10, price)

One may not be pleased by summarise_each repeating column specifications that were named earlier, but this seems like an elegant solution nonetheless.

It has the advantage over MarkusN's solution that it doesn't require matching newly created columns and doesn't change their names.

Solution by Veerendra Gadekar should end with select(cut, 8:10, price) %>% arrange(cut) in order to produce expected results (subset of columns, plus rows sorted by grouping key). Suggestion of Hong Ooi is similar to the first one here, but assumes there are no other columns.

Finally, it seems to be more legible and easy to understand than a data.table solution, like the one proposed by David Arenburg.



标签: r plyr