Is there a way to speed up the combn
command to get all unique combinations of 2 elements taken from a vector?
Usually this would be set up like this:
# Get latest version of data.table
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table)
# Toy data
d <- data.table(id=as.character(paste0("A", 10001:15000)))
# Transform data
system.time({
d.1 <- as.data.table(t(combn(d$id, 2)))
})
However, combn
is 10 times slower (23sec versus 3 sec on my computer) than calculating all possible combinations using data.table.
system.time({
d.2 <- d[, list(neighbor=d$id[-which(d$id==id)]), by=c("id")]
})
Dealing with very large vectors, I am searching for a way to save memory by only calculating the unique combinations (like combn
), but with the speed of data.table (see second code snippet).
I appreciate any help.
Here are two base-R solutions if you don't want to use additional dependencies:
comb2.int
usesrep
and other sequence generating functions to generate the desired output.comb2.mat
creates a matrix, usesupper.tri()
to get the upper triangle andwhich(..., arr.ind = TRUE)
to obtain the column and row indices => all combinations.Possibility 1:
comb2.int
Possibility 2:
comb2.mat
The functions give the same result as
combn(.)
:But I have other elements in my vector than sequencial integers!
Use the return values as indices:
Benchmark:
time(
combn
) = ~5x time(comb2.mat
) = ~80x time(comb2.int
):Here's a way using
data.table
functionfoverlaps()
, that also turns out to be fast!Note that
foverlaps()
does not calculate all permutations. The subsetxid != yid
is needed to remove self overlaps. The subset could be internally handled more efficiently by implementingignoreSelf
argument - similar toIRanges::findOverlaps
.Now it's just a matter of performing a subset using the ids obtained:
So totally, ~1.4 seconds.
The advantage is that you can do the same way even if your data.table
d
has more than 1 column on which you've to get the combinations for, and using the same amount of memory (since we return the indices). In that case, you'd just do:But it's limited to replacing just
combn(., 2L)
. Not more than 2L.You could use
combnPrim
fromgRbase
Here is a solution using Rcpp.
Using the Rcpp function to get the indices and then form the data.table, works better.
A post with any variation of the word Fast in the title is incomplete without benchmarks. Before we post any benchmarks, I would just like to mention that since this question was posted, two highly optimized packages,
arrangements
andRcppAlgos
(I am the author) for generating combinations have been released forR
.To give you an idea of their speed over
combn
andgRbase::combnPrim
, here is a basic benchmark:Now, we benchmark the other functions posted for the very specific case of producing combinations choose 2 and producing a
data.table
object.The functions are as follows:
And here are the benchmarks on the example given by the OP:
We see that the function provided by @AnirbanMukherjee is the fastest for this task, followed by
RcppAlgos
/arrangements
(very close timings).They all give the same result as well:
Thanks to @Frank for pointing out how to compare two
data.tables
without going through the pains of creating newdata.tables
and then arranging them: