I have a list with 155k
files. When I random.sample(list, 100)
, while the results are not the same from the previous sample, they look similar.
Is there a better alternative to random.sample
that returns a new list of random 100 files?
folders = get_all_folders('/data/gazette-txt-files')
# get all files from all folders
def get_all_files():
files = []
for folder in folders:
files.append(glob.glob("/data/gazette-txt-files/" + folder + "/*.txt"))
# convert 2D list into 1D
formatted_list = []
for file in files:
for f in file:
formatted_list.append(f)
# 200 random text files
return random.sample(formatted_list, 200)
For purposes like randomly selecting elements from a list, using random.sample
suffices, true randomness isn't provided and I'm unaware if this is even theoretically possible.
random
(by default) uses a Pseudo Random Number Generator (PRNG) called Mersenne Twister (MT) which, although suitable for applications such as simulations (and minor things like picking from a list of paths), shouldn't be used in areas where security is a concern due to the fact that it is deterministic.
This is why Python 3.6
also introduces secrets.py
with PEP 506, which uses SystemRandom
(urandom
) by default and is capable of producing cryptographically secure pseudo random numbers.
Of course, bottom line is, that even if you use a PRNG or CPRNG to generate your numbers they're still going to be pseudo random.
You may need to seed the generator. See here in the Documentation.
Just call random.seed()
before you get the samples.