Random sampling to give an exact sum

2019-04-05 00:33发布

问题:

I want to sample 140 numbers between 1000 to 100000 such that the sum of these 140 numbers is around 2 million (2000000):

sample(1000:100000,140)

such that:

sum(sample(1000:100000,140)) = 2000000

Any pointers how I can achieve this?

回答1:

There exist an algorithm for generating such random numbers.

Originally created for MATLAB, there is an R implementation of it:

Surrogate::RandVec

Citation from MATLAB script comment:

%   This generates an n by m array x, each of whose m columns
% contains n random values lying in the interval [a,b], but
% subject to the condition that their sum be equal to s.  The
% scalar value s must accordingly satisfy n*a <= s <= n*b.  The
% distribution of values is uniform in the sense that it has the
% conditional probability distribution of a uniform distribution
% over the whole n-cube, given that the sum of the x's is s.
%
%   The scalar v, if requested, returns with the total
% n-1 dimensional volume (content) of the subset satisfying
% this condition.  Consequently if v, considered as a function
% of s and divided by sqrt(n), is integrated with respect to s
% from s = a to s = b, the result would necessarily be the
% n-dimensional volume of the whole cube, namely (b-a)^n.
%
%   This algorithm does no "rejecting" on the sets of x's it
% obtains.  It is designed to generate only those that satisfy all
% the above conditions and to do so with a uniform distribution.
% It accomplishes this by decomposing the space of all possible x
% sets (columns) into n-1 dimensional simplexes.  (Line segments,
% triangles, and tetrahedra, are one-, two-, and three-dimensional
% examples of simplexes, respectively.)  It makes use of three
% different sets of 'rand' variables, one to locate values
% uniformly within each type of simplex, another to randomly
% select representatives of each different type of simplex in
% proportion to their volume, and a third to perform random
% permutations to provide an even distribution of simplex choices
% among like types.  For example, with n equal to 3 and s set at,
% say, 40% of the way from a towards b, there will be 2 different
% types of simplex, in this case triangles, each with its own
% area, and 6 different versions of each from permutations, for
% a total of 12 triangles, and these all fit together to form a
% particular planar non-regular hexagon in 3 dimensions, with v
% returned set equal to the hexagon's area.
%
% Roger Stafford - Jan. 19, 2006

Example:

test <- Surrogate::RandVec(a=1000, b=100000, s=2000000, n=140, m=1, Seed=sample(1:1000, size = 1))
sum(test$RandVecOutput)
# 2000000
hist(test$RandVecOutput)



回答2:

Here is a hit and miss approach. The basic idea is that finding 140 numbers which sum to 2000000 is the same as breaking 1:2000000 into 140 pieces, which requires 139 cutpoints. Also, note that the minimum of 1000 is somewhat annoying. Just subtract it from all the problem data and add it back in at the end:

rand.nums <- function(a,b,n,k){
  #finds n random integers in range a:b which sum to k
  while(TRUE){
    x <- sample(1:(k - n*a),n-1, replace = TRUE) #cutpoints
    x <- sort(x)
    x <- c(x,k-n*a) - c(0,x)
    if(max(x) <= b-a) return(a+x)
  }
}

Then rand.nums(1000,100000,140,2000000) evaluates to 140 integers in the given range which sum to 2000000. For these choices of parameters, the function returns almost instantly. For other choices of the parameters, a solution might be either impossible or so tightly constrained that finding one by chance is effectively impossible. Thus, caution needs to be taken in using the function. It could be modified by adding a maxtrials parameter and returning NA if maxtrials is exceeded without finding a solution.



回答3:

Here is a try, trying to change the upper bond. The idea is to reduce the upper bound when the sum is getting higher.

sup<- 100000
tir <- vector(length = 140)
for(i in 1:140){
  print(i)
  tir[i] <- sample(1000:sup,1)
  sup <- max(1001,min(sup,abs(2000000 - sum(tir,na.rm = T))/(140-i)*2))
}
sum(tir)
[1] 2001751



回答4:

Here are some hacky ways to get near 2 million. Hopefully, someone will post a more clever solution.

In this option, we use the prob argument to make smaller values more likely and we choose the exponent by trial and error. This method is heavily skewed toward choosing lower values within the range specified in the OP.

x1 = sample(1000:100000,140, prob=(1e5:1e3)^5.5)
mean(replicate(100, sum(sample(1000:100000,140, prob=(1e5:1e3)^5.5))))
[1] 2015620

In this option, we sample from a truncated normal (truncated at your given boundaries). We initially set the mean at 2e6/140=14285.71. However, if the standard deviation is large enough to result in lots of values near the lower boundary, the truncation biases the mean higher, so we add a correction chosen by trial and error.

library(truncnorm)
x2 = rtruncnorm(140, 1e3, 1e5, mean=0.82*2e6/140, sd=1e4)
mean(replicate(1000, sum(rtruncnorm(140, 1e3, 1e5, mean=0.82*2e6/140, sd=1e4))))
[1] 2008050

If you set a smaller standard deviation, no correction is necessary. However, you get fewer values that are far from the mean this way.

mean(replicate(1000, sum(rtruncnorm(140, 1e3, 1e5, mean=2e6/140, sd=0.5e4))))
[1] 2008494

In either case, the exponent for the sample approach, or the correction to the truncated normal can be chosen by an automated search with tolerances on how much the mean sum differs from 2 million.

Here are some typical distributions of the output: