Aggregating duplicate rows by taking sum

2019-06-24 19:20发布

问题:

Following on from my questions:
1. Identifying whether a set of variables uniquely identifies each row of the data or not;
2. Tagging all rows that are duplicates in terms of a given set of variables,
I would now like to aggregate/consolidate all the duplicate rows in terms of a given set of variables, by taking their sum.

Solution 1:

There is some guidance on how to do this here, but when there are a large number of levels of the variables that form the index, the ddply method recommended there is slow, as it was in the case where I was trying to tag all the duplicates by a given set of variables.

# Values of (f1, f2, f3, f4) uniquely identify observations
dfUnique = expand.grid(f1 = factor(1:16),
                       f2 = factor(1:41),
                       f3 = factor(1:2),
                       f4 = factor(1:104))

# sample some extra rows and rbind them
dfDup = rbind(dfUnique, dfUnique[sample(1:nrow(dfUnique), 100), ])

# dummy data 
dfDup$data = rnorm(nrow(dfDup))

# aggregate the duplicate rows by taking the sum
dfDupAgg = ddply(dfDup, .(f1, f2, f3, f4), summarise, data = sum(data))

Solution 2:

The second solution is to use data.table, and following the advice here, I could do

# data.table solution
indexVars = paste0('f', 1:4, sep = '')
dtDup = data.table(dfDup, key = indexVars)
dtDupAgg = dtDup[, list(data = sum(data)), by = key(dtDup)]

I have a couple of questions:
1. Is there a way to make the ddply version faster?
2. Is the data.table correct? I want to check since I am new to data.table.

回答1:

Regarding your data.table solution, you don't need to set key for aggregation operations. You can directly do:

indexVars = paste0('f', 1:4, sep = '')
dtDup <- as.data.table(dfDup) ## faster than data.table(.)
dtDupAgg = dtDup[, list(data = sum(data)), by = c(indexVars)]

data.table version 1.9.2+ also implements a function setDT that enables conversion of data.frames to data.tables by reference (which means, there is no copy and therefore takes almost no time in the conversion, especially useful on large data.frames).

So, instead of doing:

dtDup <- as.data.table(dfDup)
dtDup[...]

You could do:

## data.table v1.9.2+
setDT(dfDup) ## faster than as.data.table(.)
dfDup[...]   ## dfDup is now a data.table, converted by reference

On your first question, plyr is not known for its speed. Check Why is plyr so slow? (and the many informative comments there) for more info.

Perhaps you maybe interested in dplyr, which is orders of magnitude faster than plyr, but still slower than data.table, IMHO. Here's the equivalent dplyr version:

dfDup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data))

Here's a benchmark between data.table and dplyr on the data (all timings are minimum of three consecutive runs):

## data.table v1.9.2+
system.time(ans1 <- dtDup[, list(data=sum(data)), by=c(indexVars)])
#  user  system elapsed 
# 0.049   0.009   0.057 

## dplyr (commit ~1360 from github)
system.time(ans2 <- dfDup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data)))
#  user  system elapsed 
# 0.374   0.013   0.389 

I really don't have the patience to run the plyr version (stopped after 93 seconds of first run). As you can see dplyr is much faster than plyr, but ~7x times slower than data.table here.


Check if the results are equal to be sure:

all.equal(as.data.frame(ans1[order(f1,f2,f3,f4)]), 
          as.data.frame(ans2))
# [1] TRUE

HTH