Most efficient way of randomly choosing a set of d

2019-01-08 21:45发布

站内文章 / 后端开发

32 0

乱世女痞

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm looking for the most efficient algorithm to randomly choose a set of n distinct integers, where all the integers are in some range [0..maxValue].

Constraints:

maxValue is larger than n, and possibly much larger
I don't care if the output list is sorted or not
all integers must be chosen with equal probability

My initial idea was to construct a list of the integers [0..maxValue] then extract n elements at random without replacement. But that seems quite inefficient, especially if maxValue is large.

Any better solutions?

回答1:

For small values of maxValue such that it is reasonable to generate an array of all the integers in memory then you can use a variation of the Fisher-Yates shuffle except only performing the first n steps.

If n is much smaller than maxValue and you don't wish to generate the entire array then you can use this algorithm:

Keep a sorted list l of number picked so far, initially empty.
Pick a random number x between 0 and maxValue - (elements in l)
For each number in l if it smaller than or equal to x, add 1 to x
Add the adjusted value of x into the sorted list and repeat.

If n is very close to maxValue then you can randomly pick the elements that aren't in the result and then find the complement of that set.

Here is another algorithm that is simpler but has potentially unbounded execution time:

Keep a set s of element picked so far, initially empty.
Pick a number at random between 0 and maxValue.
If the number is not in s, add it to s.
Go back to step 2 until s has n elements.

In practice if n is small and maxValue is large this will be good enough for most purposes.

回答2:

Here is an optimal algorithm, assuming that we are allowed to use hashmaps. It runs in O(n) time and space (and not O(maxValue) time, which is too expensive).

It is based on Floyd's random sample algorithm. See my blog post about it for details. The code is in Java:

private static Random rnd = new Random();

public static Set<Integer> randomSample(int max, int n) {
    HashSet<Integer> res = new HashSet<Integer>(n);
    int count = max + 1;
    for (int i = count - n; i < count; i++) {
        Integer item = rnd.nextInt(i + 1);
        if (res.contains(item))
            res.add(i);
        else
            res.add(item);
    }
    return res;
}

回答3:

One way to do it without generating the full array.

Say I want a randomly selected subset of m items from a set {x1, ..., xn} where m <= n.

Consider element x1. I add x1 to my subset with probability m/n.

If I do add x1 to my subset then I reduce my problem to selecting (m - 1) items from {x2, ..., xn}.
If I don't add x1 to my subset then I reduce my problem to selecting m items from {x2, ..., xn}.

Lather, rinse, and repeat until m = 0.

This algorithm is O(n) where n is the number of items I have to consider.

I rather imagine there is an O(m) algorithm where at each step you consider how many elements to remove from the "front" of the set of possibilities, but I haven't convinced myself of a good solution and I have to do some work now!

回答4:

If you are selecting M elements out of N, the strategy changes depending on whether M is of the same order as N or much less (i.e. less than about N/log N).

If they are similar in size, then you go through each item from 1 to N. You keep track of how many items you've got so far (let's call that m items picked out of n that you've gone through), and then you take the next number with probability (M-m)/(N-n) and discard it otherwise. You then update m and n appropriately and continue. This is a O(N) algorithm with low constant cost.

If, on the other hand, M is significantly less than N, then a resampling strategy is a good one. Here you will want to sort M so you can find them quickly (and that will cost you O(M log M) time--stick them into a tree, for example). Now you pick numbers uniformly from 1 to N and insert them into your list. If you find a collision, pick again. You will collide about M/N of the time (actually, you're integrating from 1/N to M/N), which will require you to pick again (recursively), so you'll expect to take M/(1-M/N) selections to complete the process. Thus, your cost for this algorithm is approximately O(M*(N/(N-M))*log(M)).

These are both such simple methods that you can just implement both--assuming you have access to a sorted tree--and pick the one that is appropriate given the fraction of numbers that will be picked.

(Note that picking numbers is symmetric with not picking them, so if M is almost equal to N, then you can use the resampling strategy, but pick those numbers to not include; this can be a win, even if you have to push all almost-N numbers around, if your random number generation is expensive.)

回答5:

My solution is the same as Mark Byers'. It takes O(n^2) time, hence it's useful when n is much smaller than maxValue. Here's the implementation in python:

def pick(n, maxValue):
    chosen = []
    for i in range(n):
        r = random.randint(0, maxValue - i)
        for e in chosen:
            if e <= r:
                r += 1
            else:
                break;
        bisect.insort(chosen, r)
    return chosen

回答6:

The trick is to use a variation of shuffle or in other words a partial shuffle.

function random_pick( a, n ) 
{
  N = len(a);
  n = min(n, N);
  picked = array_fill(0, n, 0); backup = array_fill(0, n, 0);
  // partially shuffle the array, and generate unbiased selection simultaneously
  // this is a variation on fisher-yates-knuth shuffle
  for (i=0; i<n; i++) // O(n) times
  { 
    selected = rand( 0, --N ); // unbiased sampling N * N-1 * N-2 * .. * N-n+1
    value = a[ selected ];
    a[ selected ] = a[ N ];
    a[ N ] = value;
    backup[ i ] = selected;
    picked[ i ] = value;
  }
  // restore partially shuffled input array from backup
  // optional step, if needed it can be ignored
  for (i=n-1; i>=0; i--) // O(n) times
  { 
    selected = backup[ i ];
    value = a[ N ];
    a[ N ] = a[ selected ];
    a[ selected ] = value;
    N++;
  }
  return picked;
}

NOTE the algorithm is strictly O(n) in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and does not need hasmaps (which may not be available and/or usualy hide a complexity behind their implementation, e.g fetch time is not O(1), it might even be O(n) in worst case)

adapted from here

回答7:

Linear congruential generator modulo maxValue+1. I'm sure I've written this answer before, but I can't find it...

回答8:

UPDATE: I am wrong. The output of this is not uniformly distributed. Details on why are here.

I think this algorithm below is optimum. I.e. you cannot get better performance than this.

For choosing n numbers out of m numbers, the best offered algorithm so far is presented below. Its worst run time complexity is O(n), and needs only a single array to store the original numbers. It partially shuffles the first n elements from the original array, and then you pick those first n shuffled numbers as your solution.

This is also a fully working C program. What you find is:

Function getrand: This is just a PRNG that returns a number from 0 up to upto.
Function randselect: This is the function that randmoly chooses n unique numbers out of m many numbers. This is what this question is about.
Function main: This is only to demonstrate a use for other functions, so that you could compile it into a program and have fun.

#include <stdio.h>
#include <stdlib.h>

int getrand(int upto) {
    long int r;
    do {
        r = rand();
    } while (r > upto);
    return r;
}

void randselect(int *all, int end, int select) {
    int upto = RAND_MAX - (RAND_MAX % end);
    int binwidth = upto / end;

    int c;
    for (c = 0; c < select; c++) {
        /* randomly choose some bin */
        int bin = getrand(upto)/binwidth;

        /* swap c with bin */
        int tmp = all[c];
        all[c] = all[bin];
        all[bin] = tmp;
    }
}

int main() {
    int end = 1000;
    int select = 5;

    /* initialize all numbers up to end */
    int *all = malloc(end * sizeof(int));
    int c;
    for (c = 0; c < end; c++) {
        all[c] = c;
    }

    /* select select unique numbers randomly */
    srand(0);
    randselect(all, end, select);
    for (c = 0; c < select; c++) printf("%d ", all[c]);
    putchar('\n');

    return 0;
}

Here is the output of an example code where I randomly output 4 permutations out of a pool of 8 numbers for 100,000,000 many times. Then I use those many permutations to compute the probability of having each unique permutation occur. I then sort them by this probability. You notice that the numbers are fairly close, which I think means that it is uniformly distributed. The theoretical probability should be 1/1680 = 0.000595238095238095. Note how the empirical test is close to the theoretical one.