I need to select n
records at random from a set of N
(where 0 < n < N
).
A possible algorithm is:
Iterate through the list and for each element, make the probability of selection =
(number needed) / (number left)
So if you had 40 items, the first would have a
5/40
chance of being selected.If it is, the next has a
4/39
chance, otherwise it has a5/39
chance. By the time you get to the end you will have your 5 items, and often you'll have all of them before that.
Assuming a good pseudo-random number generator, is this algorithm correct?
NOTE
There're many questions of this kind on stackoverflow (a lot of them are marked as duplicates of Select N random elements from a List<T> in C#).
The above algorithm is often proposed (e.g. Kyle Cronin's answer) and it's always questioned (e.g. see here, here, here, here...).
Can I have a final word about the matter?
Although the algorithm described is technically correct, it depends on having an algorithm to return a bool with arbitrary probability determined by the ratio of two ints. For example, how do you select this item with a probability of 7/22? For the point of talking, let's call it the
bool RandomSelect(int x, int y)
method, or just theRS(x,y)
method, designed to returntrue
with probabilityx/y
. If you're not very concerned about accuracy, the oft-given answer is to usereturn Random.NextDouble() < (double)x/(double)y;
which is inaccurate becauseRandom.NextDouble()
is imprecise and not perfectly uniform, and the division(double)x/(double)y
is also imprecise. The choice of<
or<=
should be irrelevant (but it's not) because in theory it's impossible to randomly pick the infinite precision random number exactly equal to the specified probability. While I'm sure an algorithm can be created or found, to implement theRS(x,y)
method precisely, which would then allow you to implement the described algorithm correctly, I think that to simply answer this question as "yes the algorithm is correct" would be misleading - as it has misled so many people before, into calculating and choosing elements usingdouble
, unaware of the bias they're introducing.Don't misunderstand me - I'm not saying everyone should avoid using the described algorithm - I'm only saying that unless you find a more precise way to implement the
RS(x,y)
algorithm, your selections will be subtly biased in favor of some elements more frequently than other elements.If you care about fairness (equal probability of all possible outcomes) I think it is better, and easier to understand, to use a different algorithm instead, as I've described below:
If you take it as given that the only source of random you have available is random bits, you have to define a technique of random selection that assures equal probability, given binary random data. This means, if you want to pick a random number in a range that happens to be a power of 2, you just pick random bits and return them. But if you want a random number in a range that's not a power of 2, you have to get more random bits, and discard outcomes that could not map to fair outcomes (throw away the random number and try again). I blogged about it with pictoral representations and C# example code here: https://nedharvey.com/blog/?p=284 Repeat the random selection from your collection, until you have
n
unique items.The algorithm is absolutely correct.
It's not the sudden invention of a good poster, it's a well known technique called Selection sampling / Algorithm S (discovered by Fan, Muller and Rezucha (1) and independently by Jones (2) in 1962), well described in TAOCP - Volume 2 - Seminumerical Algorithms - § 3.4.2.
As Knuth says:
The algorithm samples
n
elements from a set of sizeN
and thet + 1
st element is chosen with probability(n - m) / (N - t)
, when alreadym
elements have been chosen.It's easy to see that we never run off the end of the set before choosing
n
items (as the probability will be1
when we havek
elements to choose from the remainingk
elements).Also we never pick too many elements (the probability will be
0
as soonn == m
).It's a bit harder to demonstrate that the sample is completely unbiased, but it's
(so not just on Stackoverflow!)
The fact is we should not confuse conditional and unconditional probabilities:
TAOCP - Vol 2 - Section 3.4.2 exercise 3
Apart from theoretical considerations, Algorithm S (and algorithm R / reservoir sampling) is used in many well known libraries (e.g. SGI's original STL implementation,
std::experimental::sample
,random.sample
in Python...).Of course algorithm S is not always the best answer:
O(N)
(even if we will usually not have to pass over allN
records: the average number of records considered whenn=2
is about2/3 N
; the general formulas are given in TAOCP - Vol 2 - § 3.4.2 - ex 5/6);N
isn't known in advance.Anyway it works!
EDIT
This is from N3925 (small modifications to avoid the common interface / tag dispatch):
There aren't floats.
uniform_int_distribution
"works as advertised" there is no bias.