Suppose I have a vector c(1, 2, 3, 4)
with no duplicated values. I need a vector c(1 * 2, 1 * 3, 1 * 4, 2 * 3, 2 * 4, 3 * 4)
, so the multiplication is done in all possible combinations of this vector's values. Is there a way of doing that? Thanks in advance!
问题:
回答1:
We can use combn
with anonymous function call
combn(vec, 2, FUN = function(x) x[1] * x[2])
#[1] 2 3 4 6 8 12
data
vec <- 1:4
回答2:
This is fun enough. I thought combn(1:4, 2, "*")
would be the simplest solution but it actually does not work. We have to use combn(1:4, 2, prod)
as Onyambu commented. The issue is: in an "N choose K" setting, FUN
must be able to take a vector of length-K as input. "*"
is not the right one.
## K = 2 case
"*"(c(1, 2)) ## this is different from: "*"(1, 2)
#Error in *c(1, 2) : invalid unary operator
prod(c(1, 2))
#[1] 2
We are going too far, but we will meet this sooner or later
Thanks Maurits Evers for elaboration on outer
/ lower.tri
/ upper.tri
. Here is an adapted way to avoid generating those temporary matrices from outer
and *****.tri
:
tri_ind <- function (n, lower= TRUE, diag = FALSE) {
if (diag) {
tmp <- n:1
j <- rep.int(1:n, tmp)
i <- sequence(tmp) - 1L + j
} else {
tmp <- (n-1):1
j <- rep.int(1:(n-1), tmp)
i <- sequence(tmp) + j
}
if (lower) list(i = i, j = j)
else list(i = j, j = i)
}
vec <- 1:4
ind <- tri_ind(length(vec), FALSE, FALSE)
#$i
#[1] 1 1 1 2 2 3
#
#$j
#[1] 2 3 4 3 4 4
vec[ind[[1]]] * vec[ind[[2]]]
#[1] 2 3 4 6 8 12
The tri_ind
function is wrapper of my this answer. It could be used as a fast and memory-efficient alternative for combn(length(vec), 2)
or its outer
-equivalence.
Originally I linked an finv
function but it is not good for benchmarking, as it is designed for extracting a few elements from a "dist" object (a collapsed lower triangular matrix). If all elements of the triangular matrix are referenced, its index computation actually imposes unnecessary overhead. tri_ind
is a better option.
library(bench)
benchmark index generation
bench1 <- function (n) {
bench::mark("combn" = combn(n, 2),
"tri_ind" = tri_ind(n, TRUE, FALSE),
"upper.tri" = which(upper.tri(matrix(0, n, n)), arr.ind = TRUE),
check = FALSE)
}
## for small problem, `tri_ind` is already the fastest
bench1(100)
# expression min mean median max `itr/sec` mem_alloc n_gc n_itr
# <chr> <bch:tm> <bch:tm> <bch:t> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#1 combn 11.6ms 11.9ms 11.9ms 12.59ms 83.7 39.1KB 9 32
#2 tri_ind 189.3µs 205.9µs 194.6µs 4.82ms 4856. 60.4KB 21 1888
#3 upper.tri 618.4µs 635.8µs 624.1µs 968.36µs 1573. 411.7KB 57 584
## `tri_ind` is 10x faster than `upper.tri`, and 100x faster than `combn`
bench1(5000)
# expression min mean median max `itr/sec` mem_alloc n_gc
# <chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#1 combn 30.6s 30.6s 30.6s 30.6s 0.0327 95.4MB 242
#2 tri_ind 231.98ms 259.31ms 259.31ms 286.63ms 3.86 143.3MB 0
#3 upper.tri 3.02s 3.02s 3.02s 3.02s 0.332 953.6MB 4
benchmark on OP's problem
bench2 <- function (n) {
vec <- numeric(n)
bench::mark("combn" = combn(vec, 2, prod),
"tri_ind" = {ind <- tri_ind(n, FALSE, FALSE);
vec[ind[[1]]] * vec[ind[[2]]]},
"upper.tri" = {m <- outer(vec, vec);
c(m[upper.tri(m)])},
check = FALSE)
}
bench2(100)
# expression min mean median max `itr/sec` mem_alloc n_gc n_itr
# <chr> <bch:tm> <bch:tm> <bch:t> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#1 combn 18.6ms 19.2ms 19.1ms 20.55ms 52.2 38.7KB 4 22
#2 tri_ind 386.9µs 432.3µs 395.6µs 7.58ms 2313. 176.6KB 1 1135
#3 upper.tri 326.9µs 488.5µs 517.6µs 699.07µs 2047. 336KB 0 1024
bench2(5000)
# expression min mean median max `itr/sec` mem_alloc n_gc n_itr
# <chr> <bch:tm> <bch:tm> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl> <int>
#1 combn 48.13s 48.13s 48.13s 48.13s 0.0208 95.3MB 204 1
#2 tri_ind 861.7ms 861.7ms 861.7ms 861.7ms 1.16 429.3MB 0 1
#3 upper.tri 1.95s 1.95s 1.95s 1.95s 0.514 810.6MB 3 1
For me, it is interesting to know that combn
is not written in compiled code. It actually has an R-level for loop inside. The various alternatives is just trying to speed it up in "N choose 2" case without writing compiled code.
Better choice??
Function combinations
from gtools
package uses recursive algorithm, which is problematic for big problem size. Function combn
from combinat
package does not use compiled code so it is no better than combn
from R core. The RcppAlgos
package by Joseph Wood has a comboGenearl
function which is the fastest one I see so far.
library(RcppAlgos)
## index generation
bench3 <- function (n) {
bench::mark("tri_ind" = tri_ind(n, FALSE, FALSE),
"Joseph" = comboGeneral(n, 2), check = FALSE)
}
bench3(5000)
# expression min mean median max `itr/sec` mem_alloc n_gc n_itr
# <chr> <bch:tm> <bch:tm> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl> <int>
#1 tri_ind 290ms 297ms 297ms 303ms 3.37 143.4MB 4 2
#2 Joseph 134ms 155ms 136ms 212ms 6.46 95.4MB 2 4
## on OP's problem
bench4 <- function (n) {
vec <- numeric(n)
bench::mark("tri_ind" = {ind <- tri_ind(n, FALSE, FALSE);
vec[ind[[1]]] * vec[ind[[2]]]},
"Joseph" = comboGeneral(vec, 2, constraintFun = "prod", keepResults = TRUE),
check = FALSE)
}
bench4(5000)
# expression min mean median max `itr/sec` mem_alloc n_gc n_itr
# <chr> <bch:tm> <bch:tm> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl> <int>
#1 tri_ind 956ms 956ms 956ms 956ms 1.05 429MB 3 1
#2 Joseph 361ms 362ms 362ms 363ms 2.76 286MB 1 2
Joseph Wood has various answers on combinations / permutations. For example: Faster version of combn.
回答3:
Here is the "outer
+ upper triangular part option"
m <- outer(1:4, 1:4)
as.numeric(m[upper.tri(m)])
#[1] 2 3 6 4 8 12
Another method is to index the elements of the upper/lower triangular part of a matrix directly and then calculate the pairwise product for those elements (adapted from this post)
upperouter <- function(x) {
N <- length(x)
i <- sequence(1:N)
j <- rep(1:N, 1:N)
(1:N)[i[i != j]] * (1:N)[j[j != i]]
}
upperouter(1:4)
#[1] 2 3 6 4 8 12
Benchmark analysis
It's interesting to compare the different methods in a microbenchmark
analysis for a larger vector
(e.g. 1:100
):
upperouter <- function(x) {
N <- length(x)
i <- sequence(1:N)
j <- rep(1:N, 1:N)
(1:N)[i[i != j]] * (1:N)[j[j != i]]
}
finv <- function (n) {
k <- 1:(n * (n - 1) / 2)
j <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k - 1))) / 2)
i <- j + k - (2 * n - j) * (j - 1) / 2
cbind(i, j)
}
N <- 100
library(microbenchmark)
res <- microbenchmark(
combn = combn(1:N, 2, prod),
outer = {
m <- outer(1:N, 1:N)
as.numeric(m[upper.tri(m)])
},
upperouter = {
upperouter(1:N)
},
finv = {
vec <- 1:N
ind <- finv(length(vec))
vec[ind[, 2]] * vec[ind[, 1]]
},
sapply = {
m <- sapply(1:N, "*", 1:N)
as.numeric(m[upper.tri(m)])
})
res
#Unit: microseconds
# expr min lq mean median uq max neval
# combn 6584.938 6896.0545 7584.8084 7035.9575 7886.5720 12020.626 100
# outer 106.791 113.6535 157.3774 138.9205 160.5985 950.706 100
# upperouter 201.943 210.1515 277.0989 227.6370 259.1975 2806.962 100
# finv 308.447 324.1960 442.3220 332.7250 375.3490 4128.325 100
# sapply 232.805 249.9080 298.3674 283.8580 315.9145 556.463 100
library(ggplot2)
autoplot(res)