Select n records at random from a set of N

2019-02-24 20:18发布

问题:

I need to select n records at random from a set of N (where 0 < n < N).

A possible algorithm is:

Iterate through the list and for each element, make the probability of selection = (number needed) / (number left)

So if you had 40 items, the first would have a 5/40 chance of being selected.

If it is, the next has a 4/39 chance, otherwise it has a 5/39 chance. By the time you get to the end you will have your 5 items, and often you'll have all of them before that.

Assuming a good pseudo-random number generator, is this algorithm correct?


NOTE

There're many questions of this kind on stackoverflow (a lot of them are marked as duplicates of Select N random elements from a List<T> in C#).

The above algorithm is often proposed (e.g. Kyle Cronin's answer) and it's always questioned (e.g. see here, here, here, here...).

Can I have a final word about the matter?

回答1:

The algorithm is absolutely correct.

It's not the sudden invention of a good poster, it's a well known technique called Selection sampling / Algorithm S (discovered by Fan, Muller and Rezucha (1) and independently by Jones (2) in 1962), well described in TAOCP - Volume 2 - Seminumerical Algorithms - § 3.4.2.

As Knuth says:

This algorithm may appear to be unreliable at first glance and, in fact, to be incorrect. But a careful analysis shows that it is completely trustworthy.

The algorithm samples n elements from a set of size N and the t + 1st element is chosen with probability (n - m) / (N - t), when already m elements have been chosen.

It's easy to see that we never run off the end of the set before choosing n items (as the probability will be 1 when we have k elements to choose from the remaining k elements).

Also we never pick too many elements (the probability will be 0 as soon n == m).

It's a bit harder to demonstrate that the sample is completely unbiased, but it's

... true in spite of the fact that we are not selecting the t + 1st item with probability n / N. This has caused some confusion in the published literature

(so not just on Stackoverflow!)

The fact is we should not confuse conditional and unconditional probabilities:

For example consider the second element; if the first element was selected in the sample (this happen with probability n / N), the second element is selected with probability (n - 1) / (N - 1); if the first element was not selected, the second element is selected with probability n / (N - 1).

The overall probability of selecting the second element is (n / N) ((n - 1) / (N - 1)) + (1 - n/N)(n / (N - 1)) = n/N.

TAOCP - Vol 2 - Section 3.4.2 exercise 3

Apart from theoretical considerations, Algorithm S (and algorithm R / reservoir sampling) is used in many well known libraries (e.g. SGI's original STL implementation, std::experimental::sample, random.sample in Python...).

Of course algorithm S is not always the best answer:

  • it's O(N) (even if we will usually not have to pass over all N records: the average number of records considered when n=2 is about 2/3 N; the general formulas are given in TAOCP - Vol 2 - § 3.4.2 - ex 5/6);
  • it cannot be used when the value of N isn't known in advance.

Anyway it works!


  1. C. T. Fan, M. E. Muller and I. Rezucha, J. Amer. Stat. Assoc. 57 (1962), pp 387 - 402
  2. T. G. Jones, CACM 5 (1962), pp 343

EDIT

how do you randomly select this item, with a probability of 7/22

[CUT]

In rare cases, you might even pick 4 or 6 elements when you wanted 5

This is from N3925 (small modifications to avoid the common interface / tag dispatch):

template<class PopIter, class SampleIter, class Size, class URNG>
SampleIter sample(PopIter first, PopIter last, SampleIter out, Size n, URNG &&g)
{
  using dist_t = uniform_int_distribution<Size>;
  using param_t = typename dist_t::param_type;

  dist_t d{};

  Size unsampled_sz = distance(first, last);
  for (n = min(n, unsampled_sz); n != 0;  ++first)
  {
    param_t const p{0, --unsampled_sz};

    if (d(g, p) < n) { *out++ = *first; --n; }
  }

  return out;
}

There aren't floats.

  • If you need 5 elements you get 5 elements;
  • if uniform_int_distribution "works as advertised" there is no bias.


回答2:

Although the algorithm described is technically correct, it depends on having an algorithm to return a bool with arbitrary probability determined by the ratio of two ints. For example, how do you select this item with a probability of 7/22? For the point of talking, let's call it the bool RandomSelect(int x, int y) method, or just the RS(x,y) method, designed to return true with probability x/y. If you're not very concerned about accuracy, the oft-given answer is to use return Random.NextDouble() < (double)x/(double)y; which is inaccurate because Random.NextDouble() is imprecise and not perfectly uniform, and the division (double)x/(double)y is also imprecise. The choice of < or <= should be irrelevant (but it's not) because in theory it's impossible to randomly pick the infinite precision random number exactly equal to the specified probability. While I'm sure an algorithm can be created or found, to implement the RS(x,y) method precisely, which would then allow you to implement the described algorithm correctly, I think that to simply answer this question as "yes the algorithm is correct" would be misleading - as it has misled so many people before, into calculating and choosing elements using double, unaware of the bias they're introducing.

Don't misunderstand me - I'm not saying everyone should avoid using the described algorithm - I'm only saying that unless you find a more precise way to implement the RS(x,y) algorithm, your selections will be subtly biased in favor of some elements more frequently than other elements.

If you care about fairness (equal probability of all possible outcomes) I think it is better, and easier to understand, to use a different algorithm instead, as I've described below:

If you take it as given that the only source of random you have available is random bits, you have to define a technique of random selection that assures equal probability, given binary random data. This means, if you want to pick a random number in a range that happens to be a power of 2, you just pick random bits and return them. But if you want a random number in a range that's not a power of 2, you have to get more random bits, and discard outcomes that could not map to fair outcomes (throw away the random number and try again). I blogged about it with pictoral representations and C# example code here: https://nedharvey.com/blog/?p=284 Repeat the random selection from your collection, until you have n unique items.