This question pertains to efficient sampling from multinomial distributions with varying sample sizes and probabilities. Below I describe the approach I have used, but wonder whether it can be improved with some intelligent vectorisation.
I'm simulating dispersal of organisms amongst multiple populations. Individuals from population j
disperse to population i
with probability p[i, j]
. Given an initial abundance of 10 in population 1, and probabilities of dispersal c(0.1, 0.3, 0.6)
to populations 1, 2, and 3, respectively, we can simulate the dispersal process with rmultinom
:
set.seed(1)
rmultinom(1, 10, c(0.1, 0.3, 0.6))
# [,1]
# [1,] 0
# [2,] 3
# [3,] 7
We can extend this to consider n
source populations:
set.seed(1)
n <- 3
p <- replicate(n, diff(c(0, sort(runif(n-1)), 1)))
X <- sample(100, n)
Above, p
is a matrix of probabilities of moving from one population (column) to another (row), and X
is a vector of initial population sizes. The number of individuals dispersing between each pair of populations (and those remaining where they are) can now be simulated with:
sapply(seq_len(ncol(p)), function(i) {
rmultinom(1, X[i], p[, i])
})
# [,1] [,2] [,3]
# [1,] 19 42 11
# [2,] 8 18 43
# [3,] 68 6 8
where the value of the element at the i
th row and j
th column is the number of individuals moving from population j
to population i
. The rowSums
of this matrix give the new population sizes.
I'd like to repeat this many times, with constant probability matrix but with varying (pre-defined) initial abundances. The following small example achieves this, but is inefficient with larger problems. The resulting matrix gives the post-dispersal abundance in each of three populations for each of 5 simulations for which population had different initial abundances.
X <- matrix(sample(100, n*5, replace=TRUE), nrow=n)
apply(sapply(apply(X, 2, function(x) {
lapply(seq_len(ncol(p)), function(i) {
rmultinom(1, x[i], p[, i])
})
}), function(x) do.call(cbind, x), simplify='array'), 3, rowSums)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 79 67 45 28 74
# [2,] 92 99 40 19 52
# [3,] 51 45 16 21 35
Is there a way to better vectorise this problem?
This is a RcppGSL implementation of multi-multinomial. However, it requires you to install gsl independently....which may not be very practical.
I also compare it with a pure R implementation and jbaums's version
and this is the result
The gsl version is about 6x faster than my pure R version.
I've discovered that the
BH
package bringsboost
libraries to the table. This enables the following, which produces the same output as @RandyLai'sgsl_mmm
and as the code in my question above. (I believe enabling c++11 support should makerandom
available withoutBH
.)rowSumsC
provided by @hadley, here.However, on my machine, this is considerably slower than Randy's
gsl_mmm
, and indeed slower than my R version when there are many trials. I suspect this is due to inefficient coding, but boost'sdiscrete_distribution
also performs each multinomial trial individually whereas this process appears vectorised when usinggsl
. I'm new to c++ so not sure whether this can be made more efficient.For example: