Product between all combinations of a vector's

2019-02-24 10:37发布

问题:

Suppose I have a vector c(1, 2, 3, 4) with no duplicated values. I need a vector c(1 * 2, 1 * 3, 1 * 4, 2 * 3, 2 * 4, 3 * 4), so the multiplication is done in all possible combinations of this vector's values. Is there a way of doing that? Thanks in advance!

回答1:

We can use combn with anonymous function call

combn(vec, 2, FUN = function(x) x[1] * x[2])
#[1]  2  3  4  6  8 12

data

vec <- 1:4


回答2:

This is fun enough. I thought combn(1:4, 2, "*") would be the simplest solution but it actually does not work. We have to use combn(1:4, 2, prod) as Onyambu commented. The issue is: in an "N choose K" setting, FUN must be able to take a vector of length-K as input. "*" is not the right one.

## K = 2 case
"*"(c(1, 2))  ## this is different from: "*"(1, 2)
#Error in *c(1, 2) : invalid unary operator

prod(c(1, 2))
#[1] 2

We are going too far, but we will meet this sooner or later

Thanks Maurits Evers for elaboration on outer / lower.tri / upper.tri. Here is an adapted way to avoid generating those temporary matrices from outer and *****.tri:

tri_ind <- function (n, lower= TRUE, diag = FALSE) {
  if (diag) {
    tmp <- n:1
    j <- rep.int(1:n, tmp)
    i <- sequence(tmp) - 1L + j
    } else {
    tmp <- (n-1):1
    j <- rep.int(1:(n-1), tmp)
    i <- sequence(tmp) + j
    }
  if (lower) list(i = i, j = j)
  else list(i = j, j = i)
  }

vec <- 1:4
ind <- tri_ind(length(vec), FALSE, FALSE)
#$i
#[1] 1 1 1 2 2 3
#
#$j
#[1] 2 3 4 3 4 4

vec[ind[[1]]] * vec[ind[[2]]]
#[1]  2  3  4  6  8 12

The tri_ind function is wrapper of my this answer. It could be used as a fast and memory-efficient alternative for combn(length(vec), 2) or its outer-equivalence.

Originally I linked an finv function but it is not good for benchmarking, as it is designed for extracting a few elements from a "dist" object (a collapsed lower triangular matrix). If all elements of the triangular matrix are referenced, its index computation actually imposes unnecessary overhead. tri_ind is a better option.

library(bench)

benchmark index generation

bench1 <- function (n) {
  bench::mark("combn" = combn(n, 2),
              "tri_ind" = tri_ind(n, TRUE, FALSE),
              "upper.tri" = which(upper.tri(matrix(0, n, n)), arr.ind = TRUE),
              check = FALSE)
  }

## for small problem, `tri_ind` is already the fastest
bench1(100)
#  expression      min     mean  median      max `itr/sec` mem_alloc  n_gc n_itr
#  <chr>      <bch:tm> <bch:tm> <bch:t> <bch:tm>     <dbl> <bch:byt> <dbl> <int>
#1 combn        11.6ms   11.9ms  11.9ms  12.59ms      83.7    39.1KB     9    32
#2 tri_ind     189.3µs  205.9µs 194.6µs   4.82ms    4856.     60.4KB    21  1888
#3 upper.tri   618.4µs  635.8µs 624.1µs 968.36µs    1573.    411.7KB    57   584

## `tri_ind` is 10x faster than `upper.tri`, and 100x faster than `combn`
bench1(5000)
#  expression      min     mean   median      max `itr/sec` mem_alloc  n_gc
#  <chr>      <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl>
#1 combn         30.6s    30.6s    30.6s    30.6s    0.0327    95.4MB   242
#2 tri_ind    231.98ms 259.31ms 259.31ms 286.63ms    3.86     143.3MB     0
#3 upper.tri     3.02s    3.02s    3.02s    3.02s    0.332    953.6MB     4

benchmark on OP's problem

bench2 <- function (n) {
  vec <- numeric(n)
  bench::mark("combn" = combn(vec, 2, prod),
              "tri_ind" = {ind <- tri_ind(n, FALSE, FALSE);
                           vec[ind[[1]]] * vec[ind[[2]]]},
              "upper.tri" = {m <- outer(vec, vec);                                
                             c(m[upper.tri(m)])},
              check = FALSE)
  }

bench2(100)
#  expression      min     mean  median      max `itr/sec` mem_alloc  n_gc n_itr
#  <chr>      <bch:tm> <bch:tm> <bch:t> <bch:tm>     <dbl> <bch:byt> <dbl> <int>
#1 combn        18.6ms   19.2ms  19.1ms  20.55ms      52.2    38.7KB     4    22
#2 tri_ind     386.9µs  432.3µs 395.6µs   7.58ms    2313.    176.6KB     1  1135
#3 upper.tri   326.9µs  488.5µs 517.6µs 699.07µs    2047.      336KB     0  1024

bench2(5000)
#  expression      min     mean   median     max `itr/sec` mem_alloc  n_gc n_itr
#  <chr>      <bch:tm> <bch:tm> <bch:tm> <bch:t>     <dbl> <bch:byt> <dbl> <int>
#1 combn        48.13s   48.13s   48.13s  48.13s    0.0208    95.3MB   204     1
#2 tri_ind     861.7ms  861.7ms  861.7ms 861.7ms    1.16     429.3MB     0     1
#3 upper.tri     1.95s    1.95s    1.95s   1.95s    0.514    810.6MB     3     1

For me, it is interesting to know that combn is not written in compiled code. It actually has an R-level for loop inside. The various alternatives is just trying to speed it up in "N choose 2" case without writing compiled code.

Better choice??

Function combinations from gtools package uses recursive algorithm, which is problematic for big problem size. Function combn from combinat package does not use compiled code so it is no better than combn from R core. The RcppAlgos package by Joseph Wood has a comboGenearl function which is the fastest one I see so far.

library(RcppAlgos)

## index generation
bench3 <- function (n) {
  bench::mark("tri_ind" = tri_ind(n, FALSE, FALSE),
              "Joseph" = comboGeneral(n, 2), check = FALSE)
  }

bench3(5000)
#  expression      min     mean   median     max `itr/sec` mem_alloc  n_gc n_itr
#  <chr>      <bch:tm> <bch:tm> <bch:tm> <bch:t>     <dbl> <bch:byt> <dbl> <int>
#1 tri_ind       290ms    297ms    297ms   303ms      3.37   143.4MB     4     2
#2 Joseph        134ms    155ms    136ms   212ms      6.46    95.4MB     2     4

## on OP's problem
bench4 <- function (n) {
  vec <- numeric(n)
  bench::mark("tri_ind" = {ind <- tri_ind(n, FALSE, FALSE);
                           vec[ind[[1]]] * vec[ind[[2]]]},
              "Joseph" = comboGeneral(vec, 2, constraintFun = "prod", keepResults = TRUE),
              check = FALSE)
  }

bench4(5000)
#  expression      min     mean   median     max `itr/sec` mem_alloc  n_gc n_itr
#  <chr>      <bch:tm> <bch:tm> <bch:tm> <bch:t>     <dbl> <bch:byt> <dbl> <int>
#1 tri_ind       956ms    956ms    956ms   956ms      1.05     429MB     3     1
#2 Joseph        361ms    362ms    362ms   363ms      2.76     286MB     1     2

Joseph Wood has various answers on combinations / permutations. For example: Faster version of combn.



回答3:

Here is the "outer + upper triangular part option"

m <- outer(1:4, 1:4)
as.numeric(m[upper.tri(m)])
#[1]  2  3  6  4  8 12

Another method is to index the elements of the upper/lower triangular part of a matrix directly and then calculate the pairwise product for those elements (adapted from this post)

upperouter <- function(x) {
    N <- length(x)
    i <- sequence(1:N)
    j <- rep(1:N, 1:N)
    (1:N)[i[i != j]] * (1:N)[j[j != i]]
}
upperouter(1:4)
#[1]  2  3  6  4  8 12

Benchmark analysis

It's interesting to compare the different methods in a microbenchmark analysis for a larger vector (e.g. 1:100):

upperouter <- function(x) {
    N <- length(x)
    i <- sequence(1:N)
    j <- rep(1:N, 1:N)
    (1:N)[i[i != j]] * (1:N)[j[j != i]]
}

finv <- function (n) {
  k <- 1:(n * (n - 1) / 2)
  j <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k - 1))) / 2)
  i <- j + k - (2 * n - j) * (j - 1) / 2
  cbind(i, j)
  }


N <- 100
library(microbenchmark)
res <- microbenchmark(
    combn  = combn(1:N, 2, prod),
    outer = {
        m <- outer(1:N, 1:N)
        as.numeric(m[upper.tri(m)])
    },
    upperouter = {
        upperouter(1:N)
    },
    finv = {
        vec <- 1:N
        ind <- finv(length(vec))
        vec[ind[, 2]] * vec[ind[, 1]]
    },
    sapply = {
        m <- sapply(1:N, "*", 1:N)
        as.numeric(m[upper.tri(m)])
    })
res
#Unit: microseconds
#       expr      min        lq      mean    median        uq       max neval
#      combn 6584.938 6896.0545 7584.8084 7035.9575 7886.5720 12020.626   100
#      outer  106.791  113.6535  157.3774  138.9205  160.5985   950.706   100
# upperouter  201.943  210.1515  277.0989  227.6370  259.1975  2806.962   100
#       finv  308.447  324.1960  442.3220  332.7250  375.3490  4128.325   100
#     sapply  232.805  249.9080  298.3674  283.8580  315.9145   556.463   100

library(ggplot2)
autoplot(res)