I want to replace an existing random number based data generator (in Python) with a hash based one so that it no longer needs to generate everything in sequence, as inspired by this article.
I can create a float from 0 to 1 by taking the integer version of the hash and dividing it by the maximum value of a hash.
I can create a flat integer range by taking the float and multiplying by the flat range. I could probably use modulo and live with the bias, as the hash range is large and my flat ranges are small.
How could I use the hash to create a gaussian or normal distributed floating point value?
For all of these cases, would I be better off just using my hash as a seed for a new random.Random object and using the functions in that class to generate my numbers and rely on them to get the distribution characteristics right?
At the moment, my code is structured like this:
num_people = randint(1,100)
people = [dict() for x in range(num_people)]
for person in people:
person['surname'] = choice(surname_list)
person['forename'] = choice(forename_list)
The problem is that for a given seed to be consistent, I have to generate all the people in the same order, and I have to generate the surname then the forename. If I add a middle name in between the two then the generated forenames will change, as will all the names of all the subsequent people.
I want to structure the code like this:
h1_groupseed=1
h2_peoplecount=1
h2_people=2
h4_surname=1
h4_forename=2
num_people = pghash([h1_groupseed,h2_peoplecount]).hashint(1,100)
people = [dict() for x in range(num_people)]
for h3_index, person in enumerate(people,1):
person['surname'] = surname_list[pghash([h1_groupseed,h2_people,h3_index,h4_surname]).hashint(0, num_of_surnames - 1)]
person['forename'] = forename_list[pghash([h1_groupseed,h2_people,h3_index,h4_forename]).hashint(0, num_of_forenames - 1)]
This would use the values passed to pghash to generate a hash, and use that hash to somehow create the pseudorandom result.
First, a big caveat: DO NOT ROLL YOUR OWN CRYPTO. If you're trying to do this for security purposes, DON'T.
Next, check out this question which lists several ways to do what you want, i.e. transform a random uniform variable into a normal one: Converting a Uniform Distribution to a Normal Distribution
Unless you're doing this for your own amusement or as a learning exercise, my very strong advice is don't do this.
PRNGs have the same general structure, even if the details are wildly different. They map a seed value s into an initial state S via some function f: S←f(s); they then iterate states via some transformation h: Si+1←h(Si); and finally they map the state to an output U via some function g: Ui←g(Si). (For simple PRNGs, f() or g() are often identity functions. For more sophisticated generators such as Mersenne Twister, more is involved.)
The state transition function h() is designed to distribute new states uniformly across the state space. In other words, it's already a hash function, but with the added benefit that for any widely accepted generator it has been heavily vetted by experts to have good statistical behavior.
Mersenne Twister, Python's default PRNG, has been mathematically proven to have k-tuples be jointly uniformly distributed for all k ≤ 623. I'm guessing that whatever hash function you choose can't make such claims. Additionally, the collapsing function g() should preserve uniformity in the outcomes. You've proposed that you "can use the integer version of the hash to create a flat number range, just by taking the modulus." In general this will introduce modulo bias, so you won't end up with a uniformly distributed result.
If you stick with the built-in PRNG, there's no reason not to use the built-in Gaussian generator. If you want to do it for your own amusement there are lots of resources that will tell you how to map uniforms to Gaussians. Well-known methods include the Box-Muller method, Marsaglia's polar method, and the ziggurat method.
UPDATE
Given the additional information you've provided in your question, I think the answer you want is contained in this section of Python's documentation for
random
:Sounds like you want separate instances of
Random
for eachperson
, seeded independently of each other or with synchronized but widely separated states as described in therandom.jumpahead()
documentation. This is one of the approaches that simulation modelers have used since the early 1950's so they can maintain repeatability between configurations to make direct comparisons of two or more systems in a fair fashion. Check out the discussion of "synchronization" on the second page of this article, or starting on page 8 of this book chapter, or pick up any of the dozens of simulation textbooks available in most university libraries and read the sections on "common random numbers." (I'm not pointing you towards Wikipedia because it provides almost no details on this topic.)Here's an explicit example showing creating multiple instances of
Random
:I have gone ahead and created a simple hash-based replacement for some of the functions in the random.Random class:
Next step is to go through my code and replace all the calls to random.Random methods with pghash objects.
I have made this into a module, which I hope to upload to pypi at some point: https://github.com/UKHomeOffice/python-pghash