可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
My data frame has two columns that are used as a grouping key, 17 columns that need to be summed in each group, and one column that should be averaged instead. Let me illustrate this on a different data frame, diamonds
from ggplot2
.
I know I could do it like this:
ddply(diamonds, ~cut, summarise, x=sum(x), y=sum(y), z=sum(z), price=mean(price))
But while it is reasonable for 3 columns, it is unacceptable for 17 of them.
When researching this, I found the colwise
function, but the best I came up with is this:
cbind(ddply(diamonds, ~cut, colwise(sum, 7:9)), price=ddply(diamonds, ~cut, summarise, mean(price))[,2])
Is there a possibility to improve this even further? I would like to do it in a more straightforward way, something like (imaginary commands):
ddply(diamonds, ~cut, colwise(sum, 7:9), price=mean(price))
or:
ddply(diamonds, ~cut, colwise(sum, 7:9), colwise(mean, ~price))
To sum up:
- I don't want to have to type all 17 columns explicitly, like the first example does with
x
, y
, and z
.
- Ideally, I would like to do it with a single call to
ddply
, without resorting to cbind
(or similar functions), as in the second example.
For reference, the result I expect is 5 rows and 5 columns:
cut x y z price
1 Fair 10057.50 9954.07 6412.26 4358.758
2 Good 28645.08 28703.75 17855.42 3928.864
3 Very Good 69359.09 69713.45 43009.52 3981.760
4 Premium 82385.88 81985.82 50297.49 4584.258
5 Ideal 118691.07 118963.24 73304.61 3457.542
回答1:
Antoher solution using dplyr
. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.
library(dplyr)
library(ggplot2)
diamonds %>%
group_by(cut) %>%
summarise_each(funs(sum, mean), x:z, price) %>%
select(cut, matches("[xyz]_sum"), price_mean)
回答2:
I would like to suggest data.table
solutions for this. You can easily predefine the columns you want operate either by position or by names and then reuse the same code no matter how many column you want to operate on.
Predifine column names
Sums <- 7:9
Means <- "price"
Run the code
library(data.table)
data.table(diamonds)[, c(lapply(.SD[, Sums, with = FALSE], sum),
lapply(.SD[, Means, with = FALSE], mean))
, by = cut]
# cut x y z price
# 1: Ideal 118691.07 118963.24 73304.61 3457.542
# 2: Premium 82385.88 81985.82 50297.49 4584.258
# 3: Good 28645.08 28703.75 17855.42 3928.864
# 4: Very Good 69359.09 69713.45 43009.52 3981.760
# 5: Fair 10057.50 9954.07 6412.26 4358.758
For your specific example, this could simplified to just
data.table(diamonds)[, c(lapply(.SD[, 7:9, with = FALSE], sum), pe = mean(price)), by = cut]
# cut x y z pe
# 1: Ideal 118691.07 118963.24 73304.61 3457.542
# 2: Premium 82385.88 81985.82 50297.49 4584.258
# 3: Good 28645.08 28703.75 17855.42 3928.864
# 4: Very Good 69359.09 69713.45 43009.52 3981.760
# 5: Fair 10057.50 9954.07 6412.26 4358.758
回答3:
Yet another approach (in my opinion easier to read) for your particular case (mean = sum/n
!)
nCut <- ddply(diamonds, ~cut, nrow)
res <- ddply(diamonds, ~cut, colwise(sum, 6:9))
res$price <- res$price/nCut$V1
or the more generic,
do.call(merge,
lapply(c(colwise(sum, 7:9), colwise(mean, 6)),
function(cw) ddply(diamonds, ~cut, cw)))
回答4:
Just to throw in another solution:
library(plyr)
library(ggplot2)
trans <- list(mean = 8:10, sum = 7)
makeList <- function(inL, mdat = diamonds, by = ~cut) {
colN <- names(mdat)
args <- unlist(llply(names(inL), function(n) {
llply(inL[[n]], function(x) {
ret <- list(call(n, as.symbol(colN[[x]])))
names(ret) <- paste(n, colN[[x]], sep = ".")
ret
})
}))
args$.data <- as.symbol(deparse(substitute(mdat)))
args$.variables <- by
args$.fun <- as.symbol("summarise")
args
}
do.call(ddply, makeList(trans))
# cut mean.x mean.y mean.z sum.price
# 1 Fair 6.246894 6.182652 3.982770 7017600
# 2 Good 5.838785 5.850744 3.639507 19275009
# 3 Very Good 5.740696 5.770026 3.559801 48107623
# 4 Premium 5.973887 5.944879 3.647124 63221498
# 5 Ideal 5.507451 5.520080 3.401448 74513487
The idea is that the function makeList
creates an argument list for ddply
. In this way you can quite easily add terms to the list (as function.name = column.indices
) and ddply
will work as expected:
trans <- c(trans, sd = list(9:10))
do.call(ddply, makeList(trans))
# cut mean.x mean.y mean.z sum.price sd.y sd.z
# 1 Fair 6.246894 6.182652 3.982770 7017600 0.9563804 0.6516384
# 2 Good 5.838785 5.850744 3.639507 19275009 1.0515353 0.6548925
# 3 Very Good 5.740696 5.770026 3.559801 48107623 1.1029236 0.7302281
# 4 Premium 5.973887 5.944879 3.647124 63221498 1.2597511 0.7311610
# 5 Ideal 5.507451 5.520080 3.401448 74513487 1.0744953 0.6576481
回答5:
It uses dplyr
, but I believe this will accomplish the specified aim completely in reasonably easy to read syntax:
diamonds %>%
group_by(cut) %>%
select(x:z) %>%
summarize_each(funs(sum)) %>%
merge(diamonds %>%
group_by(cut) %>%
summarize(price = mean(price))
,by = "cut")
The only "trick" is that there is a piped expression inside of the merge that handles the calculation of the mean price separately from the calculation of sums.
I benchmarked this solution against the solution provided by @David Arenburg (using data.table
) and @thothal (using plyr
as requested by the question) with 5000 replications. Here data.table
came out slower than plyr
and dplyr
. dplyr
was faster than plyr
. One imagines that the benchmark results could change as a function of the number of columns, number of levels in the grouping factor, and particular functions applied. For example, MarkusN submitted an answer after I did my initial benchmarks that is substantially faster than the previously submitted answers for the sample data. He accomplishes this by calculating many summary statistics that aren't desired and then throwing them away... surely there must be a point at which the costs of that approach outweigh the advantages.
test replications elapsed relative user.self sys.self user.child sys.child
2 dataTable 5000 119.686 2.008 119.611 0.127 0 0
1 dplyr 5000 59.614 1.000 59.676 0.004 0 0
3 plyr 5000 68.505 1.149 68.493 0.064 0 0
? MarkusN 5000 23.172 ????? 23.926 0 0 0
Certainly speed is not the only consideration. In particular, dplyr and plyr are picky about the order in which they are loaded (plyr before dplyr) and have several functions that mask each other.
回答6:
Not 100% what you are looking for but it might give you another idea on how to do it. Using data.table
you can do something like this:
diamonds2[, .(c = sum(c), p = sum(p), ce = sum(ce), pe = mean(pe)), by = cut]
To shorten the code (what you tried to do with colwise), you probably have to write some functions to achieve exactly what you want.
回答7:
For completeness, here's a solution based on dplyr
and answers posted by Veerendra Gadekar in another question and here by MarkusN.
In this particular case, it's possible to first apply sum
to some of the columns and then mean
to all columns of interest:
diamonds %>%
group_by(cut) %>%
mutate_each('sum', 8:10) %>%
summarise_each('mean', 8:10, price)
This is possible, because mean
won't change the calculated sums of columns 8:10
and will calculate the required mean of prices. But if we wanted to calculate standard deviation of prices instead of mean, this approach wouldn't work as columns 8:10
would all be 0.
A more general approach could be:
diamonds %>%
group_by(cut) %>%
mutate_each('sum', 8:10) %>%
mutate_each('mean', price) %>%
summarise_each('first', 8:10, price)
One may not be pleased by summarise_each
repeating column specifications that were named earlier, but this seems like an elegant solution nonetheless.
It has the advantage over MarkusN's solution that it doesn't require matching newly created columns and doesn't change their names.
Solution by Veerendra Gadekar should end with select(cut, 8:10, price) %>% arrange(cut)
in order to produce expected results (subset of columns, plus rows sorted by grouping key). Suggestion of Hong Ooi is similar to the first one here, but assumes there are no other columns.
Finally, it seems to be more legible and easy to understand than a data.table
solution, like the one proposed by David Arenburg.