How to sum a variable by group?

2018-12-30 22:23发布

Let's say I have two columns of data. The first contains categories such as "First", "Second", "Third", etc. The second has numbers which represent the number of times I saw "First".

For example:

Category     Frequency
First        10
First        15
First        5
Second       2
Third        14
Third        20
Second       3

I want to sort the data by Category and sum the Frequencies:

Category     Frequency
First        30
Second       5
Third        34

How would I do this in R?

标签: r sorting r-faq
13条回答
素衣白纱
2楼-- · 2018-12-30 22:49

I find ave very helpful (and efficient) when you need to apply different aggregation functions on different columns (and you must/want to stick on base R) :

e.g.

Given this input :

DF <-                
data.frame(Categ1=factor(c('A','A','B','B','A','B','A')),
           Categ2=factor(c('X','Y','X','X','X','Y','Y')),
           Samples=c(1,2,4,3,5,6,7),
           Freq=c(10,30,45,55,80,65,50))

> DF
  Categ1 Categ2 Samples Freq
1      A      X       1   10
2      A      Y       2   30
3      B      X       4   45
4      B      X       3   55
5      A      X       5   80
6      B      Y       6   65
7      A      Y       7   50

we want to group by Categ1 and Categ2 and compute the sum of Samples and mean of Freq.
Here's a possible solution using ave :

# create a copy of DF (only the grouping columns)
DF2 <- DF[,c('Categ1','Categ2')]

# add sum of Samples by Categ1,Categ2 to DF2 
# (ave repeats the sum of the group for each row in the same group)
DF2$GroupTotSamples <- ave(DF$Samples,DF2,FUN=sum)

# add mean of Freq by Categ1,Categ2 to DF2 
# (ave repeats the mean of the group for each row in the same group)
DF2$GroupAvgFreq <- ave(DF$Freq,DF2,FUN=mean)

# remove the duplicates (keep only one row for each group)
DF2 <- DF2[!duplicated(DF2),]

Result :

> DF2
  Categ1 Categ2 GroupTotSamples GroupAvgFreq
1      A      X               6           45
2      A      Y               9           40
3      B      X               7           50
6      B      Y               6           65
查看更多
千与千寻千般痛.
3楼-- · 2018-12-30 22:50

More recently, you can also use the dplyr package for that purpose:

library(dplyr)
x %>% 
  group_by(Category) %>% 
  summarise(Frequency = sum(Frequency))

#Source: local data frame [3 x 2]
#
#  Category Frequency
#1    First        30
#2   Second         5
#3    Third        34

Or, for multiple summary columns (works with one column too):

x %>% 
  group_by(Category) %>% 
  summarise_each(funs(sum))

Update for dplyr >= 0.5: summarise_each has been replaced by summarise_all, summarise_at and summarise_if family of functions in dplyr.

Or, if you have multiple columns to group by, you can specify all of them in the group_by separated with commas:

mtcars %>% 
  group_by(cyl, gear) %>%                            # multiple group columns
  summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns

For more information, including the %>% operator, see the introduction to dplyr.

查看更多
查无此人
4楼-- · 2018-12-30 22:50

While I have recently become a convert to dplyr for most of these types of operations, the sqldf package is still really nice (and IMHO more readable) for some things.

Here is an example of how this question can be answered with sqldf

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                  "Third", "Third", "Second")), 
                Frequency=c(10,15,5,2,14,20,3))

sqldf("select 
          Category
          ,sum(Frequency) as Frequency 
       from x 
       group by 
          Category")

##   Category Frequency
## 1    First        30
## 2   Second         5
## 3    Third        34
查看更多
闭嘴吧你
5楼-- · 2018-12-30 22:52

This is somewhat related to this question.

You can also just use the by() function:

x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))

Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.

查看更多
骚的不知所云
6楼-- · 2018-12-30 22:53

The recently added dplyr::tally() now makes this easier than ever:

tally(x, Category)

Category     n
First        30
Second       5
Third        34
查看更多
牵手、夕阳
7楼-- · 2018-12-30 22:56

The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:

library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), 
                  Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
#    Category V1
# 1:    First 30
# 2:   Second  5
# 3:    Third 34
system.time(data[, sum(Frequency), by = Category] )
# user    system   elapsed 
# 0.008     0.001     0.009 

Let's compare that to the same thing using data.frame and the above above:

data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
                  Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user    system   elapsed 
# 0.008     0.000     0.015 

And if you want to keep the column this is the syntax:

data[,list(Frequency=sum(Frequency)),by=Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

The difference will become more noticeable with larger datasets, as the code below demonstrates:

data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
                  Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user    system   elapsed 
# 0.055     0.004     0.059 
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), 
                  Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user    system   elapsed 
# 0.287     0.010     0.296 

For multiple aggregations, you can combine lapply and .SD as follows

data[, lapply(.SD, sum), by = Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34
查看更多
登录 后发表回答