This question already has an answer here:
So, I'm watching Robert Sedgewick's videos on Coursera and I am currently at the Shuffling one. He's showing a "poorly written" shuffling code for online poker (it has a few other bugs, which I've removed because they are not related to my question) This is how the algorithm works:
for(int i = 0; i < N; i++)
int r = new Random().nextInt(53);
swap(cardArray, i, r);
It iterates all the cards once. At each iteration a random number is generated and the i-th card is swapped with the r-th card. Simple, right?
While I understand the algorithm, I didn't understand his probability calculation. He said that because Random uses a 32-bit seed (or 64, it doesn't seem to matter), this is constrained to only 2^32 different permutations.
He also said that Knuth's algorithm is better (same for loop, but choose a number between 1 and i) because it gives you N! permutations.
I can agree with Knuth's algorithm calculations. But I think that on the first one (which is supposed to be the faulty one) there should be N^N different permutations.
Is Sedgewick wrong or am I missing a fact?
Sedgewick's way of explaining it seems very strange and obtuse to me.
Imagine you had a deck of only 3 cards and applied the algorithm shown.
After the first card was swapped there would be 3 possible outcomes. After the second, 9. And after the 3rd swap, 27. Thus, we know that using the swapping algorithm we will have 27 different possible outcomes, some of which will be duplicate outcomes to the others.
Now, we know for a fact that there are 3 * 2 * 1 = 6 possible arrangements of a 3-card deck. However, 27 is NOT divisible by 6. Therefore, we know for a fact that some arrangements will be more common than others, even without computing what they are. Therefore, the swapping algorithm will not result in an equal probability among the 6 possibilities, i.e., it will be biased towards certain arrangements.
The same exact logic extends to the case of 52 cards.
We can investigate which arrangements are preferred by looking at the distribution of outcomes in the three-card case, which are:
1 2 3 5 occurrences
1 3 2 5 occurrences
2 1 3 4 occurrences
2 3 1 4 occurrences
3 1 2 4 occurrences
3 2 1 5 occurrences
Total 27
Examining these, we notice that combinations which require 0 or 1 swaps have more occurrences than combinations that require 2 swaps. In general, the fewer the number of swaps required for the combination, the more likely it is.
Sedgwick is right, of course. To get a true random order of cards, you must first use an algorithm that selects equally among the N! possible permutations, which means one that selects one of N, one of N-1, one of N-2, etc., and produces a different result for each combination, such as the Fisher-Yates algorithm.
Secondly, it is necessary to have a PRNG with an internal state of greater than log2(N!) bits, or else it will repeat itself before reaching all combinations. For 52 cards, that's 226 bits. 32 isn't even close.
I'm sorry, but I have to disagree with the answers of Aasmund and Lee Daniel. Every permutation of N elements can be expressed as 3(N - 1) transpositions between 1 and some index i between 1 and N (which is easy to prove by induction on N - see below) Therefore, in order to generate a random permutation it is enough to generate 3(N-1) random integers between 1 and N. In other words, you random generator only needs to be able to generate 3(N-1) different integers.
Theorem
Every permutation of {1, ..., N} can be expressed as the composition of N-1 transpositions
Proof (by induction on N)
CASE N = 1.
The only permutation of {1} is (1) which can be written as the composition of 0 transpositions (the composition of 0 elements is the identity)
CASE N = 2. (Only for who wasn't convinced by the case N = 1 above)
There are two permutations of 2 elements (1,2) and (2,1). Permutation (1,2) is the transposition of 1 with 1. Permutation (2,1) is the transposition of 1 and 2.
INDUCTION N -> Case N + 1.
Take any permutation s of {1, ..., N, N+1}. If s doesn't move N+1, then s is actually a permutation of {1, ..., N} and can be written as the composition of N-1 transpositions between indexes i,j with 1<=i,j<=N.
So let's assume that s moves N+1 to K. Let t the transposition between N+1 and K. Then ts doesn't move N+1 (N+1 -> K -> N+1) and therefore ts can be written as the composition of N-2 transpositions, i.e.,
ts = t1..tN-1.
Hence, s = t1..tN-1t
which consists of N transpositions (one less than N+1).
Corollary
Every permutation of {1, ..., N} can be written as the composition of (at most) 3(N-1) permutations between 1 and i, where 1<=i<=N.
Proof
In view of the Theorem it is enough to show that any transposition between two indexes i and j can be written as the composition of 3 transpositions between 1 and some index. But
swap(i,j) = swap(1,i)swap(1,j)swap(1,i)
where the concatenation of swaps is the composition of these transpositions.
Since the sequence of numbers generated by a random number generator is uniquely determined by the seed, the argument is right - but it applies to Knuth's algorithm as well, and to any other shuffling algorithm: If N! > 2^M (where N is the number of cards and M is the number of bits in the seed), some permutations will never be generated. But even if the seed is big enough, the actual difference between the algorithms lies in the probability distribution: the first algorithm does not produce an uniform probability for the different permutations, while Knuth's does (assuming that the random generator is "random" enough). Note that Knuth's algorithm is also called the Fisher-Yates shuffle.