I'm looking for the most efficient algorithm to randomly choose a set of n distinct integers, where all the integers are in some range [0..maxValue].
Constraints:
- maxValue is larger than n, and possibly much larger
- I don't care if the output list is sorted or not
- all integers must be chosen with equal probability
My initial idea was to construct a list of the integers [0..maxValue] then extract n elements at random without replacement. But that seems quite inefficient, especially if maxValue is large.
Any better solutions?
UPDATE: I am wrong. The output of this is not uniformly distributed. Details on why are here.
I think this algorithm below is optimum. I.e. you cannot get better performance than this.
For choosing n numbers out of m numbers, the best offered algorithm so far is presented below. Its worst run time complexity is O(n), and needs only a single array to store the original numbers. It partially shuffles the first n elements from the original array, and then you pick those first n shuffled numbers as your solution.
This is also a fully working C program. What you find is:
getrand
: This is just a PRNG that returns a number from0
up toupto
.randselect
: This is the function that randmoly chooses n unique numbers out of m many numbers. This is what this question is about.main
: This is only to demonstrate a use for other functions, so that you could compile it into a program and have fun.Here is the output of an example code where I randomly output 4 permutations out of a pool of 8 numbers for 100,000,000 many times. Then I use those many permutations to compute the probability of having each unique permutation occur. I then sort them by this probability. You notice that the numbers are fairly close, which I think means that it is uniformly distributed. The theoretical probability should be 1/1680 = 0.000595238095238095. Note how the empirical test is close to the theoretical one.
If you are selecting
M
elements out ofN
, the strategy changes depending on whetherM
is of the same order asN
or much less (i.e. less than about N/log N).If they are similar in size, then you go through each item from
1
toN
. You keep track of how many items you've got so far (let's call thatm
items picked out ofn
that you've gone through), and then you take the next number with probability(M-m)/(N-n)
and discard it otherwise. You then updatem
andn
appropriately and continue. This is aO(N)
algorithm with low constant cost.If, on the other hand,
M
is significantly less thanN
, then a resampling strategy is a good one. Here you will want to sortM
so you can find them quickly (and that will cost youO(M log M)
time--stick them into a tree, for example). Now you pick numbers uniformly from1
toN
and insert them into your list. If you find a collision, pick again. You will collide aboutM/N
of the time (actually, you're integrating from 1/N to M/N), which will require you to pick again (recursively), so you'll expect to takeM/(1-M/N)
selections to complete the process. Thus, your cost for this algorithm is approximatelyO(M*(N/(N-M))*log(M))
.These are both such simple methods that you can just implement both--assuming you have access to a sorted tree--and pick the one that is appropriate given the fraction of numbers that will be picked.
(Note that picking numbers is symmetric with not picking them, so if
M
is almost equal toN
, then you can use the resampling strategy, but pick those numbers to not include; this can be a win, even if you have to push all almost-N
numbers around, if your random number generation is expensive.)The trick is to use a variation of shuffle or in other words a partial shuffle.
NOTE the algorithm is strictly
O(n)
in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and does not need hasmaps (which may not be available and/or usualy hide a complexity behind their implementation, e.g fetch time is notO(1)
, it might even beO(n)
in worst case)adapted from here
One way to do it without generating the full array.
Say I want a randomly selected subset of m items from a set {x1, ..., xn} where m <= n.
Consider element x1. I add x1 to my subset with probability m/n.
Lather, rinse, and repeat until m = 0.
This algorithm is O(n) where n is the number of items I have to consider.
I rather imagine there is an O(m) algorithm where at each step you consider how many elements to remove from the "front" of the set of possibilities, but I haven't convinced myself of a good solution and I have to do some work now!
My solution is the same as Mark Byers'. It takes O(n^2) time, hence it's useful when n is much smaller than maxValue. Here's the implementation in python:
Linear congruential generator modulo maxValue+1. I'm sure I've written this answer before, but I can't find it...