I have data wherein I have a variable z
that contains around 4000 values (from 0.0 to 1.0) for which the histogram looks like this.
Now I need to generate a random variable, call it random_z
which should replicate the above distribution.
What I have tried so far is to generate a normal distribution centered at 1.0 so that I can remove all those above 1.0 to get a distribution that will be similar. I have been using numpy.random.normal
but the problem is that I cannot set the range from 0.0 to 1.0, because usually normal distribution has a mean = 0.0 and std dev = 1.0.
Is there another way to go about generating this distribution in Python?
If you want to bootstrap you could use
random.choice()
on your observed series.Here I'll assume you'd like to smooth a bit more than that and you aren't concerned with generating new extreme values.
Use
pandas.Series.quantile()
and a uniform [0,1] random number generator, as follows.Training
S
Production
u
between 0.0 and 1.0 the usual way, e.g.,random.random()
S.quantile(u)
If you'd rather use
numpy
thanpandas
, from a quick reading it looks like you can substitutenumpy.percentile()
in step 2.Principle of Operation:
From the sample S,
pandas.series.quantile()
ornumpy.percentile()
is used to calculate the inverse cumulative distribution function for the method of Inverse transform sampling. The quantile or percentile function (relative to S) transforms a uniform [0,1] pseudo random number to a pseudo random number having the range and distribution of the sample S.Simple Sample Code
If you need to minimize coding and don't want to write and use functions that only returns a single realization, then it seems
numpy.percentile
bestspandas.Series.quantile
.Let S be a pre-existing sample.
u will be the new uniform random numbers
newR will be the new randoms drawn from a S-like distribution.
I need a sample of the kind of random numbers to be duplicated to put in
S
.For the purposes of creating an example, I am going to raise some uniform [0,1] random numbers to the third power and call that the sample
S
. By choosing to generate the example sample in this way, I will know in advance -- from the mean being equal to the definite integral of (x^3)(dx) evaluated from 0 to 1 -- that the mean of S should be1/(3+1)
=1/4
=0.25
In your application, you would need to do something else instead, perhaps read a file, to create a numpy array
S
containing the data sample whose distribution is to be duplicated.Here I will check that the mean of S is 0.25 as stated above.
get the min and max just to show how np.percentile works
The numpy.percentile function maps 0-100 to the range of S.
This isn't so great if we generate 100 new values, starting with uniforms:
because it will error out, and the scale of u is 0-1, and 0-100 is needed.
This will work:
which works fine but might need its type adjusted if you want a numpy array back
Now we have a numpy array. Let's check the mean of the new random values.
You could use rejection sampling: You generate pairs (z,y) with 0<=y<=max(f(z)) until you get a pair with y<=f(z). The generated random number is z.
The advantage of the method is that it can be used for any distribution, but it may take many iterations until you get a valid pair (z,y).
If you can approximate the cumulative density function for the distribution (for example by taking cumsum of histogram) then sampling from that distribution becomes trivial.
I guess this is essentially what the answer involving Pandas is doing.
When using
numpy.random.normal
you can pass keyword arguments to set the mean and standard deviation of your returned array. These keyword arguments areloc
(mean) andscale
(std).