Create constrained random numbers?

2019-01-26 14:45发布

站内文章 / Python

60 0

啃猪蹄的小仙女

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

CLEANED UP TEXT:

How can I create m=5 random numbers that add upp to, say n=100. But, the first random number is say, 10 < x1 < 30, the second random nr is 5 < x2 < 20, the third random nr is 10 < x3 < 25, etc. So these five random numbers add up to 100. How can I create these constrained five numbers?

[[

Related problem A1): The standard way to create five random numbers that add up to 100, is to sample four numbers between [0,100], and add the boundaries 0 and 100, and then sort these six numbers [0,x1,x2,x3,x4,100]. The five random numbers I seek, are the deltas. That is,

100 - x[4] = delta 5
x[4]- x[3] = delta 4
x[3]- x[2] = delta 3
x[2]- x[1] = delta 2
x[1] - 0   = delta 1

These five deltas will now add up to 100. For instance, they might be 0,1,2,7,90. Here is some code that solves this problem:

total_sum = 100
n = 5
v = numpy.random.multinomial(total_sum, numpy.ones(n)/n)

]]

For my problem, I can not allow wide intervals to occur, the largest spread above is 90-7 = 83 which is too wide. So, I have to specify a tighter spread, say [10,30]. This means the largest random number is 30, which disallows large spreads such as 83.

[[

Related problem A2): A partial solution to create five numbers with identical boundaries, 10 < x_i < 30, that adds up to 100 is like this: Just do like in A1) but add the lower boundary 10, to the deltas. So I get the five random numbers that I seek like this:

100 - x[4] = delta 5 + 10
x[4]- x[3] = delta 4 + 10
x[3]- x[2] = delta 3 + 10
x[2]- x[1] = delta 2 + 10
x[1] - 0   = delta 1 + 10

Basically, I do exactly like in A1) but do not start from 0, but start from 10. Thus, each number has the lower boundary 10, but they dont have an upper boundary, it can be large, too large. How to limit the upper boundary to 30? Here the problem is how to limit the upper boundary

]]

To recapitulate, the type of the problem I try to solve looks like this: I need five random numbers adding up to 100 and I need to specify the boundaries separately for each number, say [10,30] for the first random number, and then [5,10] for the second random number, and [15,35] for the third random number, etc. And they must all add up to 100.

But the real data I am using, has ~100 numbers x_i (m=50), all of them adding up to say ~400,000. And the range is typically [3000,5000] for a number x_i. These numbers are not really accurate, I am only trying to convey something about the problem size. The purpose is to do a MCMC simulation so these numbers need to be quickly generated. People have suggested very elegant solutions that really do work, but they take too long time, so I can not use them. The problem is still unsolved. Ideally I would like an O(m) solution and O(1) memory solution.

This problem should not be NP-hard, it doesnt feel like it. There should be a polynomial time solution, right?

回答1:

Suppose you need n_1 in [10,30], n_2 in [20,40], n_3 in [30,50] and n1+n2+n3=90

If you need each possible triplet (n_1, n_2, n_3) to be equally-likely, that's going to be difficult. The number of triples of the form (20, n_2, n_3) is greater than the number of triples (10, n_2, n_3), so you can't just pick n_1 uniformly.

The incredibly slow but accurate way is to generate the all 5 randoms in the correct ranges and reject the whole group if the sum is not correct.

. . . AHA!

I found a way to parametrize the choice effectively. First, though, for simplicity note that the sum of the low bounds is the minimum possible sum. If subtract the sum of the low bounds from the target number and subtract the low bound from each generated number, you get a problem where each number is in the interval [0, max_k-min_k]. That simplifies the math and array (list) handling. Let n_k be the 0-based choice with 0<=n_k<=max_k-min_k.

The order of the sums is lexicographic, with all sums beginning with n_1=0 (if any) first, then n_1==1 sums, etc. Sums are sorted by n_2 in each of those groups, then by n_3, and so on. If you know how many sums add to the target (call that T), and how many sums start with n_1=0, 1, 2, ... then you can find the starting number n1 of sum number S in in that list. Then you can reduce the problem to adding n_2+n_3+... to get T-n_1, finding sum number S - (number original sums starting with number less than n_1).

Let pulse(n) be a list of n+1 ones: (n+1)*[1] in Python terms. Let max_k,min_k be the limits for the k'th choice, and m_k = max_k-min_k be the upper limit for 0-based choices. Then there are 1+m_1 different "sums" from the choice of the first number, and pulse(m_k) gives the distribution: 1 was to make each sum from 0 to m_1. For the first two choices, there are m_1+m_+1 different sums. It turns out that the convolution of pulse(m_1) with pulse(m_2) gives the distribution.

Time to stop for some code:

    def pulse(width, value=1):
        ''' Returns a vector of (width+1) integer ones. '''
        return (width+1)*[value]

    def stepconv(vector, width):
        ''' Computes the discrete convolution of vector with a "unit"
            pulse of given width.

            Formula: result[i] = Sum[j=0 to width] 1*vector[i-j]
            Where 0 <= i <= len(vector)+width-1, and the "1*" is the value
            of the implied unit pulse function: pulse[j] = 1 for 0<=j<=width.
        '''
        result = width*[0] + vector;
        for i in range(len(vector)):
            result[i] = sum(result[i:i+width+1])
        for i in range(len(vector), len(result)):
            result[i] = sum(result[i:])
        return result

That's coded specifically for only doing convolutions with a "pulse" array, so every linear combination in the convolution is just a sum.

Those are used only in the constructor of the final class solution:

class ConstrainedRandom(object):
    def __init__(self, ranges=None, target=None, seed=None):
        self._rand = random.Random(seed)
        if ranges != None: self.setrange(ranges)
        if target != None: self.settarget(target)

    def setrange(self, ranges):
        self._ranges = ranges
        self._nranges = len(self._ranges)
        self._nmin, self._nmax = zip(*self._ranges)
        self._minsum = sum(self._nmin)
        self._maxsum = sum(self._nmax)
        self._zmax = [y-x for x,y in self._ranges]
        self._rconv = self._nranges * [None]
        self._rconv[-1] = pulse(self._zmax[-1])
        for k in range(self._nranges-1, 0, -1):
            self._rconv[k-1] = stepconv(self._rconv[k], self._zmax[k-1])

    def settarget(self, target):
        self._target = target

    def next(self, target=None):
        k = target if target != None else self._target
        k = k - self._minsum;
        N = self._rconv[0][k]
        seq = self._rand.randint(0,N-1)
        result = self._nranges*[0]
        for i in range(len(result)-1):
            cv = self._rconv[i+1]
            r_i = 0
            while k >= len(cv):
                r_i += 1
                k -= 1
            while cv[k] <= seq:
                seq -= cv[k]
                r_i += 1
                k -= 1
            result[i] = r_i
        result[-1] = k # t
        return [x+y for x,y in zip(result, self._nmin)]

    # end clss ConstrainedRandom

Use that with:

ranges = [(low, high), (low, high), ...]
cr = ConstrainedRandom(ranges, target)
seq = cr.next();
print(seq)
assert sum(seq)==target

seq = cr.next(); # get then get the next one.

...etc. The class could be trimmed down a bit, but the main space overhead is in the _rconv list, which has the stored convolutions. That's roughly N*T/2, for O(NT) storage.

The convolutions only use the ranges, with a lot of randoms generated with the same constraints, the table construction time "amortizes away" to zero. The time complexity of .next() is roughly T/2 on average and O(T), in terms of the number of indexes into the _rconv lists.

To see how the algorithm works, assume a sequence of 3 zero-based choices, with max values (5,7,3), and a 0-based target T=10. Define or import the pulse and stepconv functions in an Idle session, then:

>>> pulse(5)
[1, 1, 1, 1, 1, 1]
>>> K1 = pulse (5)
>>> K2 = stepconv(K1, 7)
>>> K3 = stepconv(K2, 3)
>>> K1
[1, 1, 1, 1, 1, 1]
>>> K2
[1, 2, 3, 4, 5, 6, 6, 6, 5, 4, 3, 2, 1]
>>> K3
[1, 3, 6, 10, 14, 18, 21, 23, 23, 21, 18, 14, 10, 6, 3, 1]
>>> K3[10]
18
>>> sum(K3)
192
>>> (5+1)*(7+1)*(3+1)
192

K3[i] shows the number of different choice n_1, n_2, n_3 such that 0 <= n_k <= m_k and Σ n_k = i. Letting * mean convolution when applied to two of these lists. Then pulse(m_2)*pulse(m_3) is gives the distribution of sums of n_2 and n_3:

>>> R23 = stepconv(pulse(7),3)
>>> R23
[1, 2, 3, 4, 4, 4, 4, 4, 3, 2, 1]
>>> len(R23)
11

Every value from 0 to T=10 is (barely) possible, so any choice is possible for the first number and there are R23[T-n_1] possible triplets adding to T=10 that start with N1. So, once you've found that there are 18 possible sums adding to 10, generate a random number S = randint(18) and count down through the R23[T:T-m_1-1:-1] array:

>>> R23[10:10-5-1:-1]
[1, 2, 3, 4, 4, 4]
>>> sum(R23[10:10-5-1:-1])
18

Note the sum of that list is the total computed in K3[10] above. A sanity check. Anyway, if S==9 was the random choice, then find how many leading terms of that array can be summed without exceeding S. That's the value of n_1. In this case 1+2+3 <= S but 1+2+3+4 > S, so n_1 is 3.

As described above, you can then reduce the problem to find n_2. The final number (n_3 in this example) will be uniquely determined.

回答2:

First, define your ranges:

ranges = [range(11,30), range(6,20), range(11,25)]

Then enumerate all possibilities:

def constrainedRandoms(ranges):
    answer = []
    for vector in itertools.product(*ranges):
        if sum(ranges) == 100:
            answer.append(vector)
    return answer

Equivalent one-liner:

answer = [v for v in itertools.product(*ranges) if sum(v)==100]

Then pick a random element from them:

myChoice = random.choice(answer)

EDIT:

The cartesian product (computed with itertools.product) itself doesn't eat much RAM (this is because itertools.product returns a generator, which uses O(1) space, but a lot of time as you pointed out). Only computing the list (answer) does. The bad news is that in order to use random.choice, you do need a list. If you really don't want to use a list, then you might need to come up with a decaying probability function. Here's a very simple probability function that you could use:

def constrainedRandoms(ranges):
    choices = (v for v in itertools.product(*ranges) if sum(v)==100) # note the parentheses. This is now a generator as well

    prob = 0.5
    decayFactor = 0.97 # set this parameter yourself, to better suit your needs
    try:
        answer = next(choices)
    except StopIteration:
        answer = None
    for choice in choices:
        answer = choice
        if random.uniform(0,1) >= prob:
            return answer
        prob *= decayFactor
    return answer

The decaying probability allows you increase the probability with which you will select the next vector that fits your constraints. Think about it this way: you have a bunch of constraints. Chances are, that you'll have a relatively small number of vectors that satisfy these constraints. This means that every time you decide not to use a vector, the probability that there is another such vector decreases. Therefore, over time, you need to be progressively more biased towards returning the current vector as "the randomly selected vector". Of course, it is still possible to go through the entire loop structure without ever returning a vector. This is why the code starts with a default of the first vector that fits the constraints (assuming one exists) and returns the last such vector if the decaying probability never allows for a vector to be returned.
Note that this decaying probability idea allows you to not have to iterate through all candidate vectors. With time, the probability of the code returning the current vector under consideration increases, thus reducing the probability of it continuing on and considering other vectors. This idea therefore helps mitigate your concerns about the runtime as well (though, I'm embarrassed to add, not by very much)

回答3:

Try this:

import random
a = random.randint(10, 30)
b = random.randint(5, 20)
c = random.randint(10, 25)
d = random.randint(5, 15)
e = 100 - (a+b+c+d)

EDIT:

Here's how you'd generate a list of n random numbers, given n range constraints and the desired target sum:

def generate(constraints, limit):
    ans = [random.choice(c) for c in constraints]
    return ans if sum(ans) == limit else generate(constraints, limit)

This is how you'd use it:

generate([range(10,31),range(5,21),range(10,26),range(5,16),range(10,26)], 100)
=> [25, 20, 25, 12, 18]

Be aware, though: if the constraints don't guarantee that the sum is eventually reached, you'll get an infinite loop and a stack overflow error, for example:

generate([range(1,11), range(10, 21)], 100)
=> RuntimeError: maximum recursion depth exceeded while calling a Python object

回答4:

Here is a generalized solution:

import random
def constrained_rndms(constraints, total):
    result = []
    for x, y in constraints:
        result.append(random.randint(x,y))
    result.append(total - sum(result))
    return result

s = constrained_rndms([(10,20),(5,20),(10,25),(5,15)],100) # -- [19, 5, 16, 15, 45]
sum(s) # -- 100

回答5:

One could count the number of ways to make each possible total using two spans, four spans, eight spans, and so forth (where a span is a range of integers including its endpoints). With those numbers tabulated, you can work backwards toward a sample. For example, suppose there are 16 spans, each including numbers 1 to 9. One finds there are w = 202752772954792 ways to make a total of t = 100. Choose a random number r in the range 1 to w. Search (or lookup) to find a number J such that the sum with J>j of leftways(t-j)*rightways(j)) exceeds r and the sum with J>j+1 does not, where leftways(i) is the number of ways of making i using the first 8 spans, and rightways(j) is the number of ways of making j using the last 8 spans. Recurse using i = t-j with the first 8 spans and j with the last 8, etc. Base cases occur whenever there is only one way of making a required total.

The code below can be revised to run more efficiently by sorting the spans into descending order by width (that is, list the widest spans first) and removing some swaps. Note that if spans are not in descending order by width, the result vector will not be in the same order as the spans. Also, it might be possible to replace the linear for j ... search in findTarget by a binary search but I haven't figured out how to do so without using quite a bit more space.

The code could be made cleaner and more clear by using objects to store the tree structures, instead of just tuples.

Space used probably is O(s²·(lg m)) if there are m spans whose maxima total up to s. Time used is O(s²·(lg m)) for the initial tabulation of sums of products and probably O(m+(lg m)·(s/m)) or O(m+(lg m)·s) for each sample. (I haven't properly analyzed space and time requirements.) On a 2GHz machine, the code as shown produces about 8000 samples per second if there are 16 identical spans 1...10; about 5000 samples per second if there are 32 identical spans 1...3; and about 2000 samples per second if there are 32 identical spans 1...30. Some sample outputs for various target totals and sets of spans are shown after the code.

from random import randrange
def randx(hi):   # Return a random integer from 1 to hi
    return 1+randrange(hi) if hi>0 else 0

# Compute and return c with each cell k set equal to 
# sum of products a[k-j] * b[j], taken over all relevant j
def sumprods(lt, rt):
    a, b = lt[0], rt[0]
    (la,ma,aa), (lb,mb,bb) = a, b
    if ma-la < mb-lb:           # Swap so |A| >= |B|
        la, ma, aa, lb, mb, bb = lb, mb, bb, la, ma, aa
    lc, mc = la+lb, ma+mb
    counts = [0]*(mc+1)
    for k in range(lc,mc+1):
        for j in range(max(lb,k-ma), min(mb,k-la)+1):
            counts[k] += aa[k-j] * bb[j]
    return (lc, mc, counts)

def maketree(v):
    lv = len(v)
    if lv<2:
        return (v[0], None, None)
    ltree = maketree(v[:lv/2])
    rtree = maketree(v[lv/2:])
    return (sumprods(ltree, rtree), ltree, rtree)


def findTarget(tototal, target, tree):
    global result
    lt, rt = tree[1], tree[2]
    # Put smaller-range tree second
    if lt[0][1]-lt[0][0] < rt[0][1]-rt[0][0]: lt, rt = rt, lt
    (la,ma,aa), (lb,mb,bb) = lt[0], rt[0]
    lc, mc = la+lb, ma+mb
    count = 0
    for j in range(max(lb,tototal-ma), min(mb,tototal-la)+1):
        i = tototal-j
        count += aa[i] * bb[j]
        if count >= target: break
    # Suppose that any way of getting i in left tree is ok
    if lt[1]: findTarget(i, randx(aa[i]), lt)
    else: result += [i]
    # Suppose that any way of getting j in right tree is ok
    if rt[1]: findTarget(j, randx(bb[j]), rt)
    else: result += [j]

spans, ttotal, tries = [(1,6), (5,11), (2,9), (3,7), (4,9), (4,12), (5,8),
                        (3,5), (2,9), (3,11), (3,9), (4,5), (5,9), (6,13),
                        (7,8), (4,11)],  100, 10

spans, ttotal, tries = [(10*i,10*i+9) for i in range(16)], 1300, 10000

spans, ttotal, tries = [(1,3) for i in range(32)],  64, 10000

spans, ttotal, tries = [(1,10) for i in range(16)], 100, 10

print 'spans=', spans
vals = [(p, q, [int(i>=p) for i in range(q+1)]) for (p,q) in spans]
tree = maketree(vals)
nways = tree[0][2][ttotal]
print 'nways({}) = {}'.format(ttotal, nways)
for i in range(1,tries):
    result = []
    findTarget(ttotal, randx(nways), tree)
    print '{:2}: {}  {}'.format(i, sum(result), result)

In the output samples shown below, the lines with colons contain a sample number, a sample-total, and a sample-values array. Other lines show the set of spans and the number of ways of making a desired total.

spans= [(1, 10), (1, 10), (1, 10), (1, 10), (1, 10), (1, 10), (1, 10), (1, 10), (1, 10), (1, 10), (1, 10), (1, 10), (1, 10), (1, 10), (1, 10), (1, 10)]
nways(100) = 202752772954792
 1: 100  [8, 9, 1, 2, 8, 1, 10, 6, 6, 3, 6, 10, 6, 10, 10, 4]
 2: 100  [2, 6, 10, 3, 1, 10, 9, 5, 10, 6, 2, 10, 9, 7, 4, 6]
 3: 100  [1, 6, 5, 1, 9, 10, 10, 7, 10, 2, 8, 9, 10, 9, 2, 1]
 4: 100  [10, 7, 6, 3, 7, 8, 6, 5, 7, 7, 5, 1, 9, 6, 9, 4]
 5: 100  [7, 1, 10, 5, 5, 4, 9, 5, 3, 9, 2, 8, 6, 8, 10, 8]
    spans= [(1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3), (1, 3)]
nways(64) = 159114492071763
 1: 64  [2, 2, 1, 3, 1, 2, 2, 2, 1, 2, 3, 3, 3, 2, 2, 1, 2, 3, 1, 3, 1, 3, 2, 1, 2, 3, 2, 2, 1, 2, 2, 2]
 2: 64  [1, 2, 1, 1, 1, 3, 3, 3, 2, 1, 1, 2, 3, 2, 2, 3, 3, 3, 1, 2, 1, 2, 2, 3, 2, 2, 1, 3, 1, 3, 2, 2]
 3: 64  [2, 1, 3, 2, 3, 3, 1, 3, 1, 3, 2, 2, 1, 2, 1, 3, 1, 3, 1, 2, 2, 2, 2, 1, 1, 3, 2, 2, 3, 2, 3, 1]
 4: 64  [2, 3, 3, 2, 3, 2, 1, 3, 2, 2, 1, 2, 1, 1, 3, 2, 2, 3, 3, 1, 1, 2, 2, 1, 1, 2, 3, 3, 2, 1, 1, 3]
 5: 64  [1, 1, 1, 3, 2, 2, 3, 2, 2, 1, 2, 2, 1, 2, 1, 1, 3, 3, 2, 3, 1, 2, 2, 3, 3, 3, 2, 2, 1, 3, 3, 1]
spans= [(0, 9), (10, 19), (20, 29), (30, 39), (40, 49), (50, 59), (60, 69), (70, 79), (80, 89), (90, 99), (100, 109), (110, 119), (120, 129), (130, 139), (140, 149), (150, 159)]
nways(1323) = 5444285920
 1: 1323  [8, 19, 25, 35, 49, 59, 69, 76, 85, 99, 108, 119, 129, 139, 148, 156]
 2: 1323  [8, 16, 29, 39, 48, 59, 69, 77, 88, 95, 109, 119, 129, 138, 147, 153]
 3: 1323  [9, 16, 28, 39, 49, 58, 69, 79, 87, 96, 106, 115, 128, 138, 149, 157]
 4: 1323  [8, 17, 29, 36, 45, 58, 69, 78, 89, 99, 106, 119, 128, 135, 149, 158]
 5: 1323  [9, 16, 27, 34, 48, 57, 69, 79, 88, 99, 109, 119, 128, 139, 144, 158]
    spans= [(1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (
1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (
1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (1, 30), (
1, 30), (1, 30), (1, 30)]
nways(640) = 19144856039395888221416547336829636235610525
 1: 640  [7, 24, 27, 9, 30, 23, 30, 27, 28, 29, 2, 30, 28, 19, 7, 27, 10, 2, 21, 23, 24, 2
7, 24, 16, 29, 8, 13, 23, 2, 19, 27, 25]
 2: 640  [30, 2, 17, 28, 30, 16, 5, 1, 26, 24, 22, 19, 26, 16, 16, 30, 27, 15, 19, 30, 15,
 30, 22, 5, 30, 9, 13, 25, 19, 15, 30, 28]
 3: 640  [2, 24, 1, 23, 20, 5, 30, 22, 24, 19, 22, 9, 28, 29, 5, 24, 14, 30, 24, 16, 26, 2
1, 26, 20, 20, 19, 24, 29, 24, 8, 23, 29]
 4: 640  [7, 20, 16, 24, 22, 14, 28, 28, 26, 8, 21, 9, 22, 24, 28, 19, 5, 13, 9, 24, 25, 2
2, 29, 18, 20, 21, 17, 26, 30, 9, 26, 30]