I have a list which I shuffle with the Python built in shuffle function (random.shuffle
)
However, the Python reference states:
Note that for even rather small len(x)
, the total number of permutations of x is larger than the period of most random number generators; this implies that most permutations of a long sequence can never be generated.
Now, I wonder what this "rather small len(x)" means. 100, 1000, 10000,...
TL;DR: It "breaks" on lists with over 2080 elements, but don't worry too much :)
Complete answer:
First of all, notice that "shuffling" a list can be understood (conceptually) as generating all possible permutations of the elements of the lists, and picking one of these permutations at random.
Then, you must remember that all self-contained computerised random number generators are actually "pseudo" random. That is, they are not actually random, but rely on a series of factors to try and generate a number that is hard to be guessed in advanced, or purposefully reproduced. Among these factors is usually the previous generated number. So, in practice, if you use a random generator continuously a certain number of times, you'll eventually start getting the same sequence all over again (this is the "period" that the documentation refers to).
Finally, the docstring on Lib/random.py (the random module) says that "The period [of the random number generator] is 2**19937-1
."
So, given all that, if your list is such that there are 2**19937
or more permutations, some of these will never be obtained by shuffling the list. You'd (again, conceptually) generate all permutations of the list, then generate a random number x, and pick the xth permutation. Next time, you generate another random number y, and pick the yth permutation. And so on. But, since there are more permutations than you'll get random numbers (because, at most after 2**19937-1
generated numbers, you'll start getting the same ones again), you'll start picking the same permutations again.
So, you see, it's not exactly a matter of how long your list is (though that does enter into the equation). Also, 2**19937-1
is quite a long number. But, still, depending on your shuffling needs, you should bear all that in mind. On a simplistic case (and with a quick calculation), for a list without repeated elements, 2081 elements would yield 2081!
permutations, which is more than 2**19937
.
I wrote that comment in the Python source originally, so maybe I can clarify ;-)
When the comment was introduced, Python's Wichmann-Hill generator had a much shorter period, and we couldn't even generate all the permutations of a deck of cards.
The period is astronomically larger now, and 2080 is correct for the current upper bound. The docs could be beefed up to say more about that - but they'd get awfully tedious.
There's a very simple explanation: A PRNG of period P has P possible starting states. The starting state wholly determines the permutation produced. Therefore a PRNG of period P cannot generate more than P distinct permutations (and that's an absolute upper bound - it may not be achieved). That's why comparing N! to P is the correct computation here. And, indeed:
>>> math.factorial(2080) > 2**19937 - 1
False
>>> math.factorial(2081) > 2**19937 - 1
True
What they mean is that permutations on n objects (noted n!) grows absurdly high very fast.
Basically n! = n x n-1 x ... x 1; for example, 5! = 5 x 4 x 3 x 2 x 1 = 120 which means there are 120 possible ways of shuffling a 5-items list.
On the same Python page documentation they give 2^19937-1 as the period, which is 4.something × 10^6001 or something. Based on the Wikipedia page on factorials, I guess 2000! should be around that. (Sorry, I didn't find the exact figure.)
So basically there are so many possible permutations the shuffle will take from that there's probably no real reason to worry about those it won't.
But if it really is an issue (pesky customer asking for a guarantee of randomness perhaps?), you could also offload the task to some third-party; see http://www.random.org/ for example.