How to create a list of random integer vector whos

2019-01-18 06:40发布

问题:

Creating a random vector whose sum is X (e.g. X=1000) is fairly straight forward:

import random
def RunFloat():
    Scalar = 1000
    VectorSize = 30
    RandomVector = [random.random() for i in range(VectorSize)]
    RandomVectorSum = sum(RandomVector)
    RandomVector = [Scalar*i/RandomVectorSum for i in RandomVector]
    return RandomVector
RunFloat()

The code above create a vector whose values are floats and sum is 1000.

I'm having difficulty creating a simple function for creating a vector whose values are integers and sum is X (e.g. X=1000*30)

import random
def RunInt():
    LowerBound = 600
    UpperBound = 1200
    VectorSize = 30
    RandomVector = [random.randint(LowerBound,UpperBound) for i in range(VectorSize)]
    RandomVectorSum = 1000*30
    #Sanity check that our RandomVectorSum is sensible/feasible
    if LowerBound*VectorSize <= RandomVectorSum and RandomVectorSum <= UpperBound*VectorSum:
        if sum(RandomVector) == RandomVectorSum:
            return RandomVector
        else:
            RunInt()  

Does anyone have any suggestions to improve on this idea? My code might never finish or run into recursion depth problems.

Edit (July 9, 2012)

Thanks to Oliver, mgilson, and Dougal for their inputs. My solution is shown below.

  1. Oliver was very creative with the multinomial distribution idea
  2. Put simply, (1) is very likely to output certain solutions more so than others. Dougal demonstrated that the multinomial solution space distribution is not uniform or normal by a simple test/counter example of Law of Large Numbers. Dougal also suggested to use numpy's multinomial function which saves me a lot of trouble, pain, and headaches.
  3. To overcome (2)'s output issue, I use RunFloat() to give what appears (I haven't tested this so its just a superficial appearance) to be a more uniform distribution. How much of a difference does this make compared to (1)? I don't really know off-hand. It's good enough for my use though.
  4. Thanks again to mgilson for the alternative method that does not use numpy.

Here is the code that I have made for this edit:

Edit #2 (July 11,2012)

I realized that the normal distribution is not correctly implemented, I have since modified it to the following:

import random
def RandFloats(Size):
    Scalar = 1.0
    VectorSize = Size
    RandomVector = [random.random() for i in range(VectorSize)]
    RandomVectorSum = sum(RandomVector)
    RandomVector = [Scalar*i/RandomVectorSum for i in RandomVector]
    return RandomVector

from numpy.random import multinomial
import math
def RandIntVec(ListSize, ListSumValue, Distribution='Normal'):
    """
    Inputs:
    ListSize = the size of the list to return
    ListSumValue = The sum of list values
    Distribution = can be 'uniform' for uniform distribution, 'normal' for a normal distribution ~ N(0,1) with +/- 5 sigma  (default), or a list of size 'ListSize' or 'ListSize - 1' for an empirical (arbitrary) distribution. Probabilities of each of the p different outcomes. These should sum to 1 (however, the last element is always assumed to account for the remaining probability, as long as sum(pvals[:-1]) <= 1).  
    Output:
    A list of random integers of length 'ListSize' whose sum is 'ListSumValue'.
    """
    if type(Distribution) == list:
        DistributionSize = len(Distribution)
        if ListSize == DistributionSize or (ListSize-1) == DistributionSize:
            Values = multinomial(ListSumValue,Distribution,size=1)
            OutputValue = Values[0]
    elif Distribution.lower() == 'uniform': #I do not recommend this!!!! I see that it is not as random (at least on my computer) as I had hoped
        UniformDistro = [1/ListSize for i in range(ListSize)]
        Values = multinomial(ListSumValue,UniformDistro,size=1)
        OutputValue = Values[0]
    elif Distribution.lower() == 'normal':
        """
        Normal Distribution Construction....It's very flexible and hideous
        Assume a +-3 sigma range.  Warning, this may or may not be a suitable range for your implementation!
        If one wishes to explore a different range, then changes the LowSigma and HighSigma values
        """
        LowSigma    = -3#-3 sigma
        HighSigma   = 3#+3 sigma
        StepSize    = 1/(float(ListSize) - 1)
        ZValues     = [(LowSigma * (1-i*StepSize) +(i*StepSize)*HighSigma) for i in range(int(ListSize))]
        #Construction parameters for N(Mean,Variance) - Default is N(0,1)
        Mean        = 0
        Var         = 1
        #NormalDistro= [self.NormalDistributionFunction(Mean, Var, x) for x in ZValues]
        NormalDistro= list()
        for i in range(len(ZValues)):
            if i==0:
                ERFCVAL = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
                NormalDistro.append(ERFCVAL)
            elif i ==  len(ZValues) - 1:
                ERFCVAL = NormalDistro[0]
                NormalDistro.append(ERFCVAL)
            else:
                ERFCVAL1 = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
                ERFCVAL2 = 0.5 * math.erfc(-ZValues[i-1]/math.sqrt(2))
                ERFCVAL = ERFCVAL1 - ERFCVAL2
                NormalDistro.append(ERFCVAL)  
            #print "Normal Distribution sum = %f"%sum(NormalDistro)
            Values = multinomial(ListSumValue,NormalDistro,size=1)
            OutputValue = Values[0]
        else:
            raise ValueError ('Cannot create desired vector')
        return OutputValue
    else:
        raise ValueError ('Cannot create desired vector')
    return OutputValue
#Some Examples        
ListSize = 4
ListSumValue = 12
for i in range(100):
    print RandIntVec(ListSize, ListSumValue,Distribution=RandFloats(ListSize))

The code above can be found on github. It is part of a class I built for school. user1149913, also posted a nice explanation of the problem.

回答1:

I would suggest not doing this recursively:

When you sample recursively, the value from the first index has a much greater possible range, whereas values in subsequent indices will be constrained by the first value. This will yield something resembling an exponential distribution.

Instead, what I'd recommend is sampling from the multinomial distribution. This will treat each index equally, constrain the sum, force all values to be integers, and sample uniformly from all possible configurations that follow these rules (note: configurations that can happen multiple ways will be weighted by the number of ways that they can occur).

To help merge your question with the multinomial notation, total sum is n (an integer), and so each of the k values (one for each index, also integers) must be between 0 and n. Then follow the recipe here.

(Or use numpy.random.multinomial as @Dougal helpfully suggested).



回答2:

I just ran both @Oliver's multinomial approach and @mgilson's code a million times each, for a length-3 vector summing to 10, and looked at the number of times each possible outcome came up. Both are extremely nonuniform:

(I'm about to show the indexing approach.)

Does this matter? Depends on whether you want "an arbitrary vector with this property that's usually different each time" vs each valid vector being equally likely.

In the multinomial approach, of course 3 3 4 is going to be much more likely than 0 0 10 (4200 times more likely, as it turns out). mgilson's biases are less obvious to me, but 0 0 10 and its permutations were the least likely by far (only ~750 times each out of a million); the most common were 1 4 5 and its permutations; not sure why, but they were certainly the most common, followed by 1 3 6. It'll typically start with a sum that's too high in this configuration (expectation 15), though I'm not sure why the reduction works out that way....

One way to get a uniform output over the possible vectors would be a rejection scheme. To get a vector of length K with sum N, you'd:

  1. Sample a vector of length K with integer elements uniformly and independently between 0 and N.
  2. Repeat until the sum of the vector is N.

Obviously this is going to be extremely slow for non-tiny K and N.

Another approach would be to assign a numbering to all the possible vectors; there are (N + K - 1) choose (K - 1) such vectors, so just choose a random integer in that range to decide which one you want. One reasonable way to number them is lexicographic ordering: (0, 0, 10), (0, 1, 9), (0, 2, 8), (0, 3, 7), ....

Note that the last (Kth) element of the vector is uniquely determined by the sum of the first K-1.

I'm sure there's a nice way to immediately jump to whatever index in this list, but I can't think of it right now....enumerating the possible outcomes and walking over them will work, but will probably be slower than necessary. Here's some code for that (though we actually use reverse lexicographic ordering here...).

from itertools import islice, combinations_with_replacement
from functools import reduce
from math import factorial
from operator import mul
import random

def _enum_cands(total, length):
    # get all possible ways of choosing 10 of our indices
    # for example, the first one might be  0000000000
    # meaning we picked index 0 ten times, for [10, 0, 0]
    for t in combinations_with_replacement(range(length), 10):
        cand = [0] * length
        for i in t:
            cand[i] += 1
        yield tuple(cand)

def int_vec_with_sum(total, length):
    num_outcomes = reduce(mul, range(total + 1, total + length)) // factorial(length - 1)
    # that's integer division, even though SO thinks it's a comment :)
    idx = random.choice(range(num_outcomes))
    return next(islice(_enum_cands(total, length), idx, None))

As shown in the histogram above, this is actually uniform over possible outcomes. It's also easily adaptable to upper/lower bounds on any individual element; just add the condition to _enum_cands.

This is slower than either of the other answers: for sum 10 length 3, I get

  • 14.7 us using np.random.multinomial,
  • 33.9 us using mgilson's,
  • 88.1 us with this approach

I'd expect that the difference would get worse as the number of possible outcomes increases.

If someone comes up with a nifty formula for indexing into these vectors somehow, it'd be much better....



回答3:

Here's a pretty straight forward implementation.

import random
import math

def randvec(vecsum, N, maxval, minval):
    if N*minval > vecsum or N*maxval < vecsum:
        raise ValueError ('Cannot create desired vector')

    indices = list(range(N))
    vec = [random.randint(minval,maxval) for i in indices]
    diff = sum(vec) - vecsum # we were off by this amount.

    #Iterate through, incrementing/decrementing a random index 
    #by 1 for each value we were off.
    while diff != 0:  
        addthis = 1 if diff > 0 else -1 # +/- 1 depending on if we were above or below target.
        diff -= addthis

        ### IMPLEMENTATION 1 ###
        idx = random.choice(indices) # Pick a random index to modify, check if it's OK to modify
        while not (minval < (vec[idx] - addthis) < maxval):  #operator chaining.  If you don't know it, look it up.  It's pretty cool.
            idx = random.choice(indices) #Not OK to modify.  Pick another.

        vec[idx] -= addthis #Update that index.

        ### IMPLEMENTATION 2 ###
        # random.shuffle(indices)
        # for idx in indices:
        #    if minval < (vec[idx] - addthis) < maxval:
        #        vec[idx]-=addthis
        #        break
        #
        # in situations where (based on choices of N, minval, maxval and vecsum)
        # many of the values in vec MUST BE minval or maxval, Implementation 2
        # may be superior.

    return vec

a = randvec(1000,20,100,1)
print sum(a)


回答4:

The most efficient way to sample uniformly from the set of partitions of N elements into K bins is to use a dynamic programming algorithm, which is O(KN). There are a multichoose (http://mathworld.wolfram.com/Multichoose.html) number of possibilities, so enumerating every one will be very slow. Rejection sampling and other monte-carlo methods will also likely be very slow.

Other methods people propose, like sampling from a multinomial do not draw samples from a uniform distribution.

Let T(n,k) be the number of partitions of n elements into k bins, then we can compute the recurrence

T(n,1)=1 \forall n>=0
T(n,k)=\sum_{m<=n} T(n-m,k-1)

To sample K elements that sum to N, sample from K multinomial distributions going "backward" in the recurrence: Edit: The T's in the multinomial's below should be normalized to sum to one before drawing each sample.

n1 = multinomial([T(N,K-1),T(N-1,K-1),...,T(0,K-1)])
n2 = multinomial([T(N-n1,K-1),T(N-n1-1,K-1),...,T(0,K-1)])
...
nK = multinomial([T(N-sum([n1,...,n{k-1}]),1),T(N-sum([n1,...,n{k-1}])-1,1),...,T(0,1)])

Note: I am allowing 0's to be sampled.

This procedure is similar to sampling a set of hidden state from a segmental semi-markov model (http://www.gatsby.ucl.ac.uk/%7Echuwei/paper/icml103.pdf).



回答5:

Just to give you another approach, implement a partition_function(X) and randomly choose a number between 0 and the length of partition_function(1000) and there you have it. Now you just need to find an efficient way to calculate a partition function. These links might help:

http://code.activestate.com/recipes/218332-generator-for-integer-partitions/

http://oeis.org/A000041

EDIT: Here is a simple code:

import itertools
import random
all_partitions = {0:set([(0,)]),1:set([(1,)])}

def partition_merge(a,b):
    c = set()
    for t in itertools.product(a,b):
        c.add(tuple(sorted(list(t[0]+t[1]))))
    return c

def my_partition(n):
    if all_partitions.has_key(n):
        return all_partitions[n]
    a = set([(n,)])
    for i in xrange(1,n/2+1):
        a = partition_merge(my_partition(i),my_partition(n-i)).union(a)
    all_partitions[n] = a
    return a

if __name__ == '__main__':
    n = 30
    # if you have a few years to wait uncomment the next line
    # n = 1000
    a = my_partition(n)
    i = random.randint(0,len(a)-1)
    print(list(a)[i])


回答6:

This version will give a uniform distribution:

from random import randint

def RunInt(VectorSize, Sum):
   x = [randint(0, Sum) for _ in range(1, VectorSize)]
   x.extend([0, Sum])
   x.sort()
   return [x[i+1] - x[i] for i in range(VectorSize)]


回答7:

What with:

import numpy as np
def RunInt(VectorSize, Sum):
    a = np.array([np.random.rand(VectorSize)])
    b = np.floor(a/np.sum(a)*Sum) 
    for i in range(int(Sum-np.sum(b))):
        b[0][np.random.randint(len(b[0]))] += 1
    return b[0]