I have a list with around 3900 elements that I need to randomly permute to produce a statistical distribution. I looked around and found this Maximal Length of List to Shuffle with Python random.shuffle that explains that the period of the PRNG in Python is 2**19937-1
, which leads to a list with a maximum length of 2080
before it becomes impossible to generate all possible permutations. I am only producing 300-1000 permutations of the list so it unlikely that I will be producing duplicate permutations, however, since this is producing a statistical distribution I would like to have all possible permutations as potential samples.
问题:
回答1:
I agree with @user2357112 that it is unlikely to be a genuine issue -- but it seems like you should be able to use the standard random
module in such a way that all permutations are at least possible.
You could do a divide-and-conquer approach. Use the initial seed to partition the list into 2 lists of around 2000 each. The number of such partitions is roughly C(4000,2000)
which is approximately 1.66 x 10^1202
. This is less that the period, which suggests that it is at least possible for all such partitions to be generated with random.sample()
. Then - reseed the random number generator and permute the first half. Then -- reseed a second time and permute the second half. Perhaps throw in little time delays before the reseedings so you don't run into issues involving the resolution of your system clock. You could also experiment in randomly partitioning the initial list into larger numbers of smaller lists.
Mathematically, it is easy to see that if you randomly partition a list into sublists so that each partition is equally likely and then you permute each sublist in such a way that all sublist permutations are equally likely, and glue together these sublist permutations to get a whole-list permutation, then all whole-list permutations are equally likely.
Here is an implementation:
import random, time
def permuted(items, pieces = 2):
sublists = [[] for i in range(pieces)]
for x in items:
sublists[random.randint(0,pieces-1)].append(x)
permutedList = []
for i in range(pieces):
time.sleep(0.01)
random.seed()
random.shuffle(sublists[i])
permutedList.extend(sublists[i])
return permutedList
I'm not sure that time.sleep(0.01)
is really needed. My concern was that if the reseeds happened within a millisecond then on some systems the same seed might be used.
As a final remark, just because the above function (with suitable choice of pieces
) can't be shown to miss certain permutations by a simple counting argument (comparing the number of permutations with the number of initial states) this doesn't in and of itself constitute proof that all permutations are in fact possible. That would require a more detailed analysis of the random number generator, the hash function that seeds it, and the shuffle algorithm.
回答2:
There are longer-period PRNGs than MT, but they are hard to find.
To get all 3090! combinations, you need 40,905 bits of entropy. That's about 5kb. You should be able to grab a chunk of bytes that size from someplace like random.org many times with no problem. To get precisely balanced, you'll have to add some and do rejection sampling. I.e., grab 12 bits at a time (0..4095), and reject numbers higher than your current loop index. That might inflate the number of bits needed, but probably not beyond 8kb.