Creating a random vector whose sum is X (e.g. X=1000) is fairly straight forward:
import random
def RunFloat():
Scalar = 1000
VectorSize = 30
RandomVector = [random.random() for i in range(VectorSize)]
RandomVectorSum = sum(RandomVector)
RandomVector = [Scalar*i/RandomVectorSum for i in RandomVector]
return RandomVector
RunFloat()
The code above create a vector whose values are floats and sum is 1000.
I'm having difficulty creating a simple function for creating a vector whose values are integers and sum is X (e.g. X=1000*30)
import random
def RunInt():
LowerBound = 600
UpperBound = 1200
VectorSize = 30
RandomVector = [random.randint(LowerBound,UpperBound) for i in range(VectorSize)]
RandomVectorSum = 1000*30
#Sanity check that our RandomVectorSum is sensible/feasible
if LowerBound*VectorSize <= RandomVectorSum and RandomVectorSum <= UpperBound*VectorSum:
if sum(RandomVector) == RandomVectorSum:
return RandomVector
else:
RunInt()
Does anyone have any suggestions to improve on this idea? My code might never finish or run into recursion depth problems.
Edit (July 9, 2012)
Thanks to Oliver, mgilson, and Dougal for their inputs. My solution is shown below.
- Oliver was very creative with the multinomial distribution idea
- Put simply, (1) is very likely to output certain solutions more so than others. Dougal demonstrated that the multinomial solution space distribution is not uniform or normal by a simple test/counter example of Law of Large Numbers. Dougal also suggested to use numpy's multinomial function which saves me a lot of trouble, pain, and headaches.
- To overcome (2)'s output issue, I use RunFloat() to give what appears (I haven't tested this so its just a superficial appearance) to be a more uniform distribution. How much of a difference does this make compared to (1)? I don't really know off-hand. It's good enough for my use though.
- Thanks again to mgilson for the alternative method that does not use numpy.
Here is the code that I have made for this edit:
Edit #2 (July 11,2012)
I realized that the normal distribution is not correctly implemented, I have since modified it to the following:
import random
def RandFloats(Size):
Scalar = 1.0
VectorSize = Size
RandomVector = [random.random() for i in range(VectorSize)]
RandomVectorSum = sum(RandomVector)
RandomVector = [Scalar*i/RandomVectorSum for i in RandomVector]
return RandomVector
from numpy.random import multinomial
import math
def RandIntVec(ListSize, ListSumValue, Distribution='Normal'):
"""
Inputs:
ListSize = the size of the list to return
ListSumValue = The sum of list values
Distribution = can be 'uniform' for uniform distribution, 'normal' for a normal distribution ~ N(0,1) with +/- 5 sigma (default), or a list of size 'ListSize' or 'ListSize - 1' for an empirical (arbitrary) distribution. Probabilities of each of the p different outcomes. These should sum to 1 (however, the last element is always assumed to account for the remaining probability, as long as sum(pvals[:-1]) <= 1).
Output:
A list of random integers of length 'ListSize' whose sum is 'ListSumValue'.
"""
if type(Distribution) == list:
DistributionSize = len(Distribution)
if ListSize == DistributionSize or (ListSize-1) == DistributionSize:
Values = multinomial(ListSumValue,Distribution,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'uniform': #I do not recommend this!!!! I see that it is not as random (at least on my computer) as I had hoped
UniformDistro = [1/ListSize for i in range(ListSize)]
Values = multinomial(ListSumValue,UniformDistro,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'normal':
"""
Normal Distribution Construction....It's very flexible and hideous
Assume a +-3 sigma range. Warning, this may or may not be a suitable range for your implementation!
If one wishes to explore a different range, then changes the LowSigma and HighSigma values
"""
LowSigma = -3#-3 sigma
HighSigma = 3#+3 sigma
StepSize = 1/(float(ListSize) - 1)
ZValues = [(LowSigma * (1-i*StepSize) +(i*StepSize)*HighSigma) for i in range(int(ListSize))]
#Construction parameters for N(Mean,Variance) - Default is N(0,1)
Mean = 0
Var = 1
#NormalDistro= [self.NormalDistributionFunction(Mean, Var, x) for x in ZValues]
NormalDistro= list()
for i in range(len(ZValues)):
if i==0:
ERFCVAL = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
NormalDistro.append(ERFCVAL)
elif i == len(ZValues) - 1:
ERFCVAL = NormalDistro[0]
NormalDistro.append(ERFCVAL)
else:
ERFCVAL1 = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
ERFCVAL2 = 0.5 * math.erfc(-ZValues[i-1]/math.sqrt(2))
ERFCVAL = ERFCVAL1 - ERFCVAL2
NormalDistro.append(ERFCVAL)
#print "Normal Distribution sum = %f"%sum(NormalDistro)
Values = multinomial(ListSumValue,NormalDistro,size=1)
OutputValue = Values[0]
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
#Some Examples
ListSize = 4
ListSumValue = 12
for i in range(100):
print RandIntVec(ListSize, ListSumValue,Distribution=RandFloats(ListSize))
The code above can be found on github. It is part of a class I built for school. user1149913, also posted a nice explanation of the problem.
I would suggest not doing this recursively:
When you sample recursively, the value from the first index has a much greater possible range, whereas values in subsequent indices will be constrained by the first value. This will yield something resembling an exponential distribution.
Instead, what I'd recommend is sampling from the multinomial distribution. This will treat each index equally, constrain the sum, force all values to be integers, and sample uniformly from all possible configurations that follow these rules (note: configurations that can happen multiple ways will be weighted by the number of ways that they can occur).
To help merge your question with the multinomial notation, total sum is n (an integer), and so each of the k values (one for each index, also integers) must be between 0 and n. Then follow the recipe here.
(Or use numpy.random.multinomial as @Dougal helpfully suggested).
I just ran both @Oliver's multinomial approach and @mgilson's code a million times each, for a length-3 vector summing to 10, and looked at the number of times each possible outcome came up. Both are extremely nonuniform:
(I'm about to show the indexing approach.)
Does this matter? Depends on whether you want "an arbitrary vector with this property that's usually different each time" vs each valid vector being equally likely.
In the multinomial approach, of course
3 3 4
is going to be much more likely than0 0 10
(4200 times more likely, as it turns out). mgilson's biases are less obvious to me, but0 0 10
and its permutations were the least likely by far (only ~750 times each out of a million); the most common were1 4 5
and its permutations; not sure why, but they were certainly the most common, followed by1 3 6
. It'll typically start with a sum that's too high in this configuration (expectation 15), though I'm not sure why the reduction works out that way....One way to get a uniform output over the possible vectors would be a rejection scheme. To get a vector of length
K
with sumN
, you'd:K
with integer elements uniformly and independently between0
andN
.N
.Obviously this is going to be extremely slow for non-tiny
K
andN
.Another approach would be to assign a numbering to all the possible vectors; there are
(N + K - 1) choose (K - 1)
such vectors, so just choose a random integer in that range to decide which one you want. One reasonable way to number them is lexicographic ordering:(0, 0, 10), (0, 1, 9), (0, 2, 8), (0, 3, 7), ...
.Note that the last (
K
th) element of the vector is uniquely determined by the sum of the firstK-1
.I'm sure there's a nice way to immediately jump to whatever index in this list, but I can't think of it right now....enumerating the possible outcomes and walking over them will work, but will probably be slower than necessary. Here's some code for that (though we actually use reverse lexicographic ordering here...).
As shown in the histogram above, this is actually uniform over possible outcomes. It's also easily adaptable to upper/lower bounds on any individual element; just add the condition to
_enum_cands
.This is slower than either of the other answers: for sum 10 length 3, I get
np.random.multinomial
,I'd expect that the difference would get worse as the number of possible outcomes increases.
If someone comes up with a nifty formula for indexing into these vectors somehow, it'd be much better....
This version will give a uniform distribution:
What with:
Here's a pretty straight forward implementation.
The most efficient way to sample uniformly from the set of partitions of N elements into K bins is to use a dynamic programming algorithm, which is O(KN). There are a multichoose (http://mathworld.wolfram.com/Multichoose.html) number of possibilities, so enumerating every one will be very slow. Rejection sampling and other monte-carlo methods will also likely be very slow.
Other methods people propose, like sampling from a multinomial do not draw samples from a uniform distribution.
Let T(n,k) be the number of partitions of n elements into k bins, then we can compute the recurrence
To sample K elements that sum to N, sample from K multinomial distributions going "backward" in the recurrence: Edit: The T's in the multinomial's below should be normalized to sum to one before drawing each sample.
Note: I am allowing 0's to be sampled.
This procedure is similar to sampling a set of hidden state from a segmental semi-markov model (http://www.gatsby.ucl.ac.uk/%7Echuwei/paper/icml103.pdf).