UPDATE: according to the comments let's make some clarifications.
I'm trying to understand solution for the following task: Randomly generate a set of M elements from an array of size N. Each element must have equal probability of being chosen.
I found the following solution (I've already read this question, but it does not answer my question):
int rand(Random random, int min, int max) {
return random.nextInt(1 + max - min) + min;
}
char[] generateArray(char[] original, int subsetSize) {
char[] subset = new char[subsetSize];
Random random = new Random();
for (int i = 0; i < subsetSize; i++) {
subset[i] = original[i];
}
for (int i = subsetSize; i < original.length; i++) {
int r = rand(random,0, i);
boolean takeIthElement = r < subsetSize;
if (takeIthElement) {
subset[r] = original[i];
}
}
return subset;
}
// rand() function returns inclusive value
// i.e. rand(0, 5) will return from 0 to 5
This code was found in book "Cracking the coding interview" (Section Hard, Task 3). Author explains it as follows:
Suppose we have an algorithm that can pull a random set of
m
elements from an array of sizen - 1
. How can we use this algorithm to pull a random set ofm
elements from an array of sizen
? We can first pull a random set of size m from the firstn - 1
elements. Then, we just need to decide ifarray[n]
should be inserted into our subset (which would require pulling out a random element from it). An easy way to do this is to pick a random number k from 0 through n. Ifk < m
, then insertarray[n]
intosubset[k]
. This will both "fairly" (i.e., with proportional probability) insertarray[n]
into the subset and "fairly" remove a random element from the subset. This is even cleaner to write iteratively. In this approach, we initialize an array subset to be the firstm
elements in original. Then, we iterate through the array, starting at elementm
, insertingarray[i]
into the subset at (random) positionk
wheneverk < m
.
I think author wanted to say that we need to generate not set, but array. So, I think the right task descriptions should be: Randomly generate an array of M elements from an array of size N. Each element must have equal probability of being chosen.
If it true, than the code above does not work correctly. Reasons:
- For example we have an array
{'1', '2', 'a', 'b'}
andm = 2
- Therefore, we should have qual probabilities to generate the following sets:
{1, 2}; {2, 1}; {1, a}; {a, 1}; {1, b}; {b, 1}; {a, 2}; {2, a}; {b, 2}; {2, b}; {a, b}; {b, a}
My concern here is that function will never generate the following sets:
{2, 1}; {2, a}; {2, b}
So, it means that it is incorrect.
Your second
for
loop runs twice, first withi
equal to 2, then withi
equal to 3.When
i
is 2,r
becomes 0, 1 or 2, each with probability 1/3. So the chara
is moved into your result at index 0 or 1 or not at all, each with probability 1/3. Now it is either [a, 2], [1, a] or [1, 2].When
i
is 3,r
is 0, 1, 2 or 3.b
is moved to index 0 with probability 1/4, to index 1 with probability 1/4 and is not moved anywhere with probability 1/2.In the following table I have given the result in all possible cases. The values under
r
, 0, 1 and 2, are the possible values in the first iteration (i
= 2). To the right orr
are the possible values in the second iteration.So in the table you can read that if
r
is 0 both times, your method will return[b, 2]
, etc.Each of the 12 cells in the table has equal probability, that is 1/12. Let’s inspect: [1, 2], [1, a], [1, b], [a, 2] and [b, 2] are there twice each. [a, b] and [b, a] come once each, but these are the same set, so that set comes twice too. This covers all possible subsets, so they are equally likely.
No, author truly meant set, but happens to store the resulting set in an array. By saying that the result is a set, it means that the order of values don't matter, which means that
{1, 2}
and{2, 1}
is the same set.Given that, it's ok that result will never be
{2, 1}
, as long as the probability of result with values1
and2
is 1/6, i.e. the unordered (set) probability.If you wanted an ordered result, i.e. 12 distinct results as you listed them, then the easiest solution is to shuffle the original array and take the first
M
values. That guarantees equal probability of all results and guarantees no duplicates.Shuffling an array is usually done using a Fisher–Yates shuffle, which is to iterate the array and randomly swap the element with a previous element.
The algorithm in the question is a variant of that. If skips the random shuffling of the first M values, since order doesn't matter. It then randomly swaps subsequent elements with random element, except that no swap occurs if random position is > M, and value being swapped out is simply discarded, as it would end up outside result set.
So, it is a modified Fisher–Yates shuffle to generate a random sub-set in a copy of the original array, but optimized to skip unnecessary shuffling, given we want a set, not an ordered list/array, and that we only want a subset, not all the values.
First of all, it is quite clear from the explanation and the code that the author meant set, just as they wrote. A set can be modelled as an array in the actual implementation, this does not mean anything. In programming challenges it is quite often that people use rather simple structures - like an array instead of
java.util.Set
.So the task is basically:
Assuming
N >= M
.Now the hard part: why does this algorithm produce the correct results?
Just looking at the algorithm, it's hard to understand how it works and why. I think this is because the algorithm is actually constructed recursively, with recursion finall unwrapped in iteration.
Let's start with the recursion.
Assume we are able to randomly select
M
elements from an array of sizeN - 1
. How do we selectM
elements from an array of sizeN
?Since there's a "new" element in the array, we can either replace one of the selected elements with it - or leave the set as it is. But we have to keep random properties.
A set of
M
elements fromN-1
can be selected in(N-1)! / M!*(N-1 - M)!
ways.A set of
M
elements fromN
can be selected inN! / M!*(N - M)!
ways.This means that we should keep the set with the
(N-M)/N
probability and replace one of the elements withM/N
probability. We will also have to select the element to replace with1/M
probability.Let's see how it will look in the code. Assume
subset
is our randomly selected set ofM
elements out ofN-1
.First we should decide if we replace one of the elements or not. We need a
(N-M)/N
probability. For this we can just generate a random number between0
andN
. If this number is less thanM
, we replace.Now we have to select one of the elements to replace. Since we model the array as a set, we can simply randomly select an index between
0
andM - 1
(inclusive). So we get:Here we can note that if our first random value (
rand(random, 0, N)
) is less thanM
, it is a random value between0
andM-1
. Thus, we don't need the secondrand
:The rest should be pretty trivial.
Base case of the recursion is
M == N
. In this case we don't replace anything so the set of selected elements is simple the original array.After that the recursion can be simply coded as a cycle.
i
representsN
on each step - this gives your the code your have.