可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
This is best illustrated with an example
str(mtcars)
mtcars$gear <- factor(mtcars$gear, labels=c("three","four","five"))
mtcars$cyl <- factor(mtcars$cyl, labels=c("four","six","eight"))
mtcars$am <- factor(mtcars$am, labels=c("manual","auto")
str(mtcars)
tapply(mtcars$mpg, mtcars$gear, sum)
That gives me the summed mpg per gear. But say I wanted a 3x3 table with gear across the top and cyl down the side, and 9 cells with the bivariate sums in, how would I get that 'smartly'.
I could go.
tapply(mtcars$mpg[mtcars$cyl=="four"], mtcars$gear[mtcars$cyl=="four"], sum)
tapply(mtcars$mpg[mtcars$cyl=="six"], mtcars$gear[mtcars$cyl=="six"], sum)
tapply(mtcars$mpg[mtcars$cyl=="eight"], mtcars$gear[mtcars$cyl=="eight"], sum)
This seems cumbersome.
Then how would I bring a 3rd variable in the mix?
This is somewhat in the space I'm thinking about.
Summary statistics using ddply
update This gets me there, but it's not pretty.
aggregate(mpg ~ am+cyl+gear, mtcars,sum)
Cheers
回答1:
How about this, still using tapply()
? It's more versatile than you knew!
with(mtcars, tapply(mpg, list(cyl, gear), sum))
# three four five
# four 21.5 215.4 56.4
# six 39.5 79.0 19.7
# eight 180.6 NA 30.8
Or, if you'd like the printed output to be a bit more interpretable:
with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear), sum))
If you want to use more than two cross-classifying variables, the idea's exactly the same. The results will then be returned in a 3-or-more-dimensional array:
A <- with(mtcars, tapply(mpg, list(cyl, gear, carb), sum))
dim(A)
# [1] 3 3 6
lapply(1:6, function(i) A[,,i]) # To convert results to a list of matrices
# But eventually, the curse of dimensionality will begin to kick in...
table(is.na(A))
# FALSE TRUE
# 12 42
回答2:
I think the answers already on this question are fantastic options, but I wanted to share an additional option based on the dplyr
package (this came up for me because I'm teaching a class right now where we use dplyr
for data manipulation, so I wanted to avoid introducing students to specialized base R functions like tapply
or aggregate
).
You can group on as many variables as you want using the group_by
function and then summarize information from these groups with summarize
. I think this code is more readable to an R newcomer than the formula-based interface of aggregate
, yielding identical results:
library(dplyr)
mtcars %>%
group_by(am, cyl, gear) %>%
summarize(mpg=sum(mpg))
# am cyl gear mpg
# (dbl) (dbl) (dbl) (dbl)
# 1 0 4 3 21.5
# 2 0 4 4 47.2
# 3 0 6 3 39.5
# 4 0 6 4 37.0
# 5 0 8 3 180.6
# 6 1 4 4 168.2
# 7 1 4 5 56.4
# 8 1 6 4 42.0
# 9 1 6 5 19.7
# 10 1 8 5 30.8
With two variables, you can summarize with one variable on the rows and the other on the columns by adding a call to the spread
function from the tidyr
package:
library(dplyr)
library(tidyr)
mtcars %>%
group_by(cyl, gear) %>%
summarize(mpg=sum(mpg)) %>%
spread(gear, mpg)
# cyl 3 4 5
# (dbl) (dbl) (dbl) (dbl)
# 1 4 21.5 215.4 56.4
# 2 6 39.5 79.0 19.7
# 3 8 180.6 NA 30.8
回答3:
I like Josh's answer for this, but reshape2
can also provide a nice framework for these type of problems:
library(reshape2)
#use subset to only grab the variables of interest...
mtcars.m <- melt(subset(mtcars, select = c("mpg", "gear", "cyl")), measure.vars="mpg")
#cast into appropriate format
dcast(mtcars.m, cyl ~ gear, fun.aggregate=sum, value.var="value")
cyl three four five
1 four 21.5 215.4 56.4
2 six 39.5 79.0 19.7
3 eight 180.6 0.0 30.8
回答4:
The answer contains same output using tapply and aggregate function.
I would like to add some information to Josh O'Brien's answer. User can either use aggregate function or tapply depending on output. In order to use more than one factor variable in tapply one can use the method Josh has shown.
Loading dataset
data("mtcars")
Using tapply
with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear), sum))
The output of above code is
Gear#
Cylinder# 3 4 5
4 21.5 215.4 56.4
6 39.5 79.0 19.7
8 180.6 NA 30.8
Using aggregate function
with(mtcars, aggregate(mpg, list(Cylinder = cyl, Gear = gear), sum))
Output of aggregate function
Cylinder Gear x
1 4 3 21.5
2 6 3 39.5
3 8 3 180.6
4 4 4 215.4
5 6 4 79.0
6 4 5 56.4
7 6 5 19.7
8 8 5 30.8
Now if the user wants same output as aggregate function but using tapply.
as.data.frame(as.table(with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear),
sum))))
Output of tapply function
Cylinder. Gear. Freq
1 4 3 21.5
2 6 3 39.5
3 8 3 180.6
4 4 4 215.4
5 6 4 79.0
6 8 4 NA
7 4 5 56.4
8 6 5 19.7
9 8 5 30.8
NA's can be kept or removed as per business requirements.