Summary statistics by two or more factor variables

2019-01-31 14:11发布

问题:

This is best illustrated with an example

str(mtcars)
mtcars$gear <- factor(mtcars$gear, labels=c("three","four","five"))
mtcars$cyl <- factor(mtcars$cyl, labels=c("four","six","eight"))
mtcars$am <- factor(mtcars$am, labels=c("manual","auto")
str(mtcars)
tapply(mtcars$mpg, mtcars$gear, sum)

That gives me the summed mpg per gear. But say I wanted a 3x3 table with gear across the top and cyl down the side, and 9 cells with the bivariate sums in, how would I get that 'smartly'.

I could go.

tapply(mtcars$mpg[mtcars$cyl=="four"], mtcars$gear[mtcars$cyl=="four"], sum)
tapply(mtcars$mpg[mtcars$cyl=="six"], mtcars$gear[mtcars$cyl=="six"], sum)
tapply(mtcars$mpg[mtcars$cyl=="eight"], mtcars$gear[mtcars$cyl=="eight"], sum)

This seems cumbersome.

Then how would I bring a 3rd variable in the mix?

This is somewhat in the space I'm thinking about. Summary statistics using ddply

update This gets me there, but it's not pretty.

aggregate(mpg ~ am+cyl+gear, mtcars,sum)

Cheers

回答1:

How about this, still using tapply()? It's more versatile than you knew!

with(mtcars, tapply(mpg, list(cyl, gear), sum))
#       three  four five
# four   21.5 215.4 56.4
# six    39.5  79.0 19.7
# eight 180.6    NA 30.8

Or, if you'd like the printed output to be a bit more interpretable:

with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear), sum))

If you want to use more than two cross-classifying variables, the idea's exactly the same. The results will then be returned in a 3-or-more-dimensional array:

A <- with(mtcars, tapply(mpg, list(cyl, gear, carb), sum))

dim(A)
# [1] 3 3 6
lapply(1:6, function(i) A[,,i]) # To convert results to a list of matrices

# But eventually, the curse of dimensionality will begin to kick in...
table(is.na(A))
# FALSE  TRUE 
#    12    42 


回答2:

I think the answers already on this question are fantastic options, but I wanted to share an additional option based on the dplyr package (this came up for me because I'm teaching a class right now where we use dplyr for data manipulation, so I wanted to avoid introducing students to specialized base R functions like tapply or aggregate).

You can group on as many variables as you want using the group_by function and then summarize information from these groups with summarize. I think this code is more readable to an R newcomer than the formula-based interface of aggregate, yielding identical results:

library(dplyr)
mtcars %>%
  group_by(am, cyl, gear) %>%
  summarize(mpg=sum(mpg))
#       am   cyl  gear   mpg
#    (dbl) (dbl) (dbl) (dbl)
# 1      0     4     3  21.5
# 2      0     4     4  47.2
# 3      0     6     3  39.5
# 4      0     6     4  37.0
# 5      0     8     3 180.6
# 6      1     4     4 168.2
# 7      1     4     5  56.4
# 8      1     6     4  42.0
# 9      1     6     5  19.7
# 10     1     8     5  30.8

With two variables, you can summarize with one variable on the rows and the other on the columns by adding a call to the spread function from the tidyr package:

library(dplyr)
library(tidyr)
mtcars %>%
  group_by(cyl, gear) %>%
  summarize(mpg=sum(mpg)) %>%
  spread(gear, mpg)
#     cyl     3     4     5
#   (dbl) (dbl) (dbl) (dbl)
# 1     4  21.5 215.4  56.4
# 2     6  39.5  79.0  19.7
# 3     8 180.6    NA  30.8


回答3:

I like Josh's answer for this, but reshape2 can also provide a nice framework for these type of problems:

library(reshape2)

#use subset to only grab the variables of interest...
mtcars.m <- melt(subset(mtcars, select = c("mpg", "gear", "cyl")), measure.vars="mpg")
#cast into appropriate format
dcast(mtcars.m, cyl ~ gear, fun.aggregate=sum, value.var="value")

   cyl three  four five
1  four  21.5 215.4 56.4
2   six  39.5  79.0 19.7
3 eight 180.6   0.0 30.8


回答4:

The answer contains same output using tapply and aggregate function.

I would like to add some information to Josh O'Brien's answer. User can either use aggregate function or tapply depending on output. In order to use more than one factor variable in tapply one can use the method Josh has shown.

Loading dataset

data("mtcars")

Using tapply

with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear), sum))

The output of above code is

        Gear#
Cylinder#     3     4    5
    4     21.5 215.4 56.4
    6     39.5  79.0 19.7
    8    180.6    NA 30.8

Using aggregate function

with(mtcars, aggregate(mpg, list(Cylinder = cyl, Gear =  gear), sum))

Output of aggregate function

    Cylinder Gear  x
1        4    3  21.5
2        6    3  39.5
3        8    3 180.6
4        4    4 215.4
5        6    4  79.0
6        4    5  56.4
7        6    5  19.7
8        8    5  30.8

Now if the user wants same output as aggregate function but using tapply.

as.data.frame(as.table(with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear),
sum))))

Output of tapply function

   Cylinder. Gear.  Freq
1         4     3  21.5
2         6     3  39.5
3         8     3 180.6
4         4     4 215.4
5         6     4  79.0
6         8     4    NA
7         4     5  56.4
8         6     5  19.7
9         8     5  30.8

NA's can be kept or removed as per business requirements.



标签: r summary