Generate a set of M elements from an array of size

2019-02-27 18:02发布

UPDATE: according to the comments let's make some clarifications.

I'm trying to understand solution for the following task: Randomly generate a set of M elements from an array of size N. Each element must have equal probability of being chosen.

I found the following solution (I've already read this question, but it does not answer my question):

int rand(Random random, int min, int max) {
  return random.nextInt(1 + max - min) + min;
}

char[] generateArray(char[] original, int subsetSize) {
  char[] subset = new char[subsetSize];
  Random random = new Random();

  for (int i = 0; i < subsetSize; i++) {
    subset[i] = original[i];
  }
  for (int i = subsetSize; i < original.length; i++) {
    int r = rand(random,0, i);
    boolean takeIthElement = r < subsetSize;
    if (takeIthElement) {
      subset[r] = original[i];
    }
  }

  return subset;
}
// rand() function returns inclusive value 
// i.e. rand(0, 5) will return from 0 to 5

This code was found in book "Cracking the coding interview" (Section Hard, Task 3). Author explains it as follows:

Suppose we have an algorithm that can pull a random set of m elements from an array of size n - 1. How can we use this algorithm to pull a random set of m elements from an array of size n? We can first pull a random set of size m from the first n - 1 elements. Then, we just need to decide if array[n] should be inserted into our subset (which would require pulling out a random element from it). An easy way to do this is to pick a random number k from 0 through n. If k < m, then insert array[n] into subset[k]. This will both "fairly" (i.e., with proportional probability) insert array[n] into the subset and "fairly" remove a random element from the subset. This is even cleaner to write iteratively. In this approach, we initialize an array subset to be the first m elements in original. Then, we iterate through the array, starting at element m, inserting array[i] into the subset at (random) position k whenever k < m.

I think author wanted to say that we need to generate not set, but array. So, I think the right task descriptions should be: Randomly generate an array of M elements from an array of size N. Each element must have equal probability of being chosen.

If it true, than the code above does not work correctly. Reasons:

  1. For example we have an array {'1', '2', 'a', 'b'} and m = 2
  2. Therefore, we should have qual probabilities to generate the following sets:

{1, 2}; {2, 1}; {1, a}; {a, 1}; {1, b}; {b, 1}; {a, 2}; {2, a}; {b, 2}; {2, b}; {a, b}; {b, a}

My concern here is that function will never generate the following sets: {2, 1}; {2, a}; {2, b}

So, it means that it is incorrect.

3条回答
我欲成王,谁敢阻挡
2楼-- · 2019-02-27 18:29

How can I prove it with math?

Your second for loop runs twice, first with i equal to 2, then with i equal to 3.

When i is 2, r becomes 0, 1 or 2, each with probability 1/3. So the char a is moved into your result at index 0 or 1 or not at all, each with probability 1/3. Now it is either [a, 2], [1, a] or [1, 2].

When i is 3, r is 0, 1, 2 or 3. b is moved to index 0 with probability 1/4, to index 1 with probability 1/4 and is not moved anywhere with probability 1/2.

In the following table I have given the result in all possible cases. The values under r, 0, 1 and 2, are the possible values in the first iteration (i = 2). To the right or r are the possible values in the second iteration.

r    0       1       2       3
0  [b, 2]  [a, b]  [a, 2]  [a, 2]
1  [b, a]  [1, b]  [1, a]  [1, a]
2  [b, 2]  [1, b]  [1, 2]  [1, 2]

So in the table you can read that if r is 0 both times, your method will return [b, 2], etc.

Each of the 12 cells in the table has equal probability, that is 1/12. Let’s inspect: [1, 2], [1, a], [1, b], [a, 2] and [b, 2] are there twice each. [a, b] and [b, a] come once each, but these are the same set, so that set comes twice too. This covers all possible subsets, so they are equally likely.

查看更多
劫难
3楼-- · 2019-02-27 18:47

I think author wanted to say that we need to generate not set, but array.

No, author truly meant set, but happens to store the resulting set in an array. By saying that the result is a set, it means that the order of values don't matter, which means that {1, 2} and {2, 1} is the same set.

Given that, it's ok that result will never be {2, 1}, as long as the probability of result with values 1 and 2 is 1/6, i.e. the unordered (set) probability.


If you wanted an ordered result, i.e. 12 distinct results as you listed them, then the easiest solution is to shuffle the original array and take the first M values. That guarantees equal probability of all results and guarantees no duplicates.

Shuffling an array is usually done using a Fisher–Yates shuffle, which is to iterate the array and randomly swap the element with a previous element.

The algorithm in the question is a variant of that. If skips the random shuffling of the first M values, since order doesn't matter. It then randomly swaps subsequent elements with random element, except that no swap occurs if random position is > M, and value being swapped out is simply discarded, as it would end up outside result set.

So, it is a modified Fisher–Yates shuffle to generate a random sub-set in a copy of the original array, but optimized to skip unnecessary shuffling, given we want a set, not an ordered list/array, and that we only want a subset, not all the values.

查看更多
Luminary・发光体
4楼-- · 2019-02-27 18:53

First of all, it is quite clear from the explanation and the code that the author meant set, just as they wrote. A set can be modelled as an array in the actual implementation, this does not mean anything. In programming challenges it is quite often that people use rather simple structures - like an array instead of java.util.Set.

So the task is basically:

Randomly select a set of M elements from an array of size N.

Assuming N >= M.

Now the hard part: why does this algorithm produce the correct results?

Just looking at the algorithm, it's hard to understand how it works and why. I think this is because the algorithm is actually constructed recursively, with recursion finall unwrapped in iteration.

Let's start with the recursion.

Assume we are able to randomly select M elements from an array of size N - 1. How do we select M elements from an array of size N?

Since there's a "new" element in the array, we can either replace one of the selected elements with it - or leave the set as it is. But we have to keep random properties.

A set of M elements from N-1 can be selected in (N-1)! / M!*(N-1 - M)! ways.
A set of M elements from N can be selected in N! / M!*(N - M)! ways.

This means that we should keep the set with the (N-M)/N probability and replace one of the elements with M/N probability. We will also have to select the element to replace with 1/M probability.

Let's see how it will look in the code. Assume subset is our randomly selected set of M elements out of N-1.

First we should decide if we replace one of the elements or not. We need a (N-M)/N probability. For this we can just generate a random number between 0 and N. If this number is less than M, we replace.

boolean replace = rand(random, 0, N) < M;
if (replace) {
   // then replace
}

Now we have to select one of the elements to replace. Since we model the array as a set, we can simply randomly select an index between 0 and M - 1 (inclusive). So we get:

boolean replace = rand(random, 0, N) < M;
if (replace) {
   subset[rand(random, 0, M - 1)] = original[N];
}

Here we can note that if our first random value (rand(random, 0, N)) is less than M, it is a random value between 0 and M-1. Thus, we don't need the second rand:

int r = rand(random, 0, N);
boolean replace = r < M;
if (replace) {
   subset[r] = original[N];
}

The rest should be pretty trivial.

Base case of the recursion is M == N. In this case we don't replace anything so the set of selected elements is simple the original array.

After that the recursion can be simply coded as a cycle. i represents N on each step - this gives your the code your have.

查看更多
登录 后发表回答