I need to generate a set of random numbers within an interval which also happens to have a mean value. For instance min = 1000, max = 10000 and a mean of 7000. I know how to create numbers within a range but I am struggling with the mean value thing. Is there a function that I can use?
问题:
回答1:
What you're looking for is done most easily with so called acceptance rejection method.
Split your interval into smaller intervals. Specify a probability density function (PDF), can be a very simple one too, like a step function. For Gaussian distrubution you would have left and right steps lower than your middle step i.e (see the image bellow that has a more general distribution).
Generate a random number in the whole interval. If the generated number is greater than the value of your PDF at that point reject the generated number.
Repeat the steps until you get desired number of points
EDIT 1
Proof of concept on a Gaussian PDF.
Ok, so the basic idea is shown in graph (a).
- Define/Pick your probability density function (PDF). PDF is a function of, statistically speaking, a random variable and describes the probability of finding the value x in a measurement/experiment. A function can be a PDF of a random variable
x
if it satisfies: 1)f(x) >= 0
and 2) it's normalized (meaning it sums, or integrates, up to the value 1). - Get maximum (
max
) and "zero points" (z1 < z2
) of PDF. Some PDF's can have their zero points in infinity. In that case, determine cutoff points(z1, z2)
for whichPDF(z1>x>z2) < eta
where you picketa
yourself. Basically means, set some small-ish valueeta
and then say your zero points are those values for which the value ofPDF(x)
is smaller than eta. - Define the interval
Ch(z1, z2, max)
of your random generator. This is the interval in which you generate your random variables. - Generate a random variable x such that
z1<x<z2
. - Generate a second unrelated random variable
y
in the range(0, max)
. If the value ofy
is smaller thanPDF(x)
reject both randomly generated values(x,y)
and go back to step 4. If the generated valuey
is larger thanPDF(x)
accept the valuex
as the randomly generated point on a distribution andreturn
it.
Here's the code that reproduces similar behavior for a Gaussian PDF.
#include "Random.h"
#include <fstream>
using namespace std;
double gaus(double a, double b, double c, double x)
{
return a*exp( -((x-b)*(x-b)/(2*c*c) ));
}
double* random_on_a_gaus_distribution(double inter_a, double inter_b)
{
double res [2];
double a = 1.0; //currently parameters for the Gaussian
double b = 2.0; //are defined here to avoid having
double c = 3.0; //a long function declaration line.
double x = kiss::Ran(inter_a, inter_b);
double y = kiss::Ran(0.0, 1.0);
while (y>gaus(a,b,c,x)) //keep creating values until step 5. is satisfied.
{
x = kiss::Ran(inter_a, inter_b); //this is interval (z1, z2)
y = kiss::Ran(0.0, 1.0); //this is the interval (0, max)
}
res[0] = x;
res[1] = y;
return res; //I return (x,y) for plot reasons, only x is the randomly
} //generated value you're looking for.
void main()
{
double* x;
ofstream f;
f.open("test.txt");
for(int i=0; i<100000; i++)
{
//see bellow how I got -5 and 10 to be my interval (z1, z2)
x = random_on_a_gaus_distribution(-5.0, 10.0);
f << x[0]<<","<<x[1]<<endl;
}
f.close();
}
Step 1
So first we define a general look of a Gaussian PDF in a function called gaus
. Simple.
Then we define a function random_on_a_gaus_distribution
which uses a well defined Gaussian function. In an experiment\measurement we would get coefficients a, b, c
by fitting our function. I picked some random ones (1, 2, 3) for this example, you can pick the ones that satisfy your HW assignment (that is: coefficients that make a Gaussian that has a mean of 7000).
Step 2 and 3
I used wolfram mathematica to plot gaus. with parameters 1,2,3 too see what would be the most appropriate values for max
and (z1, z2)
. You can see the graph yourself. Maximum of the function is 1.0 and via ancient method of science called eyeballin' I estimated that the cutoff points are -5.0 and 10.0.
To make random_on_a_gaus_distribution
more general you could follow step 2) more rigorously and define eta
and then calculate your function in successive points until PDF gets smaller than eta. Dangers with this are that your cutoff points can be very far apart and this could take long for very monotonous functions. Additionally you have to find the maximum yourself. This is generally tricky, However a simpler problem is minimization of a negative of a function. This can also be tricky for a general case but not "undoable". Easiest way is to cheat a bit like I did and just hard-code this for a couple of functions only.
Step 4 and 5
And then you bash away. Just keep creating new and new points until you reach satisfactory hit. DO NOTICE the returned number x
is a random number. You wouldn't be able to find a logical link between two successively created x
values, or first created x
and the millionth.
However the number of accepted x
values in the interval around the x_max
of our distribution is greater than the number of x
values created in intervals for which PDF(x) < PDF(x_max)
.
This just means that your random numbers will be weighted within the chosen interval in such manner that the larger PDF value for a random variable x
will correspond to more random points accepted in a small interval around that value than around any other value of xi
for which PDF(xi)<PDF(x)
.
I returned both x and y to be able to plot the graph bellow, however what you're looking to return is actually just the x
. I did the plots with matplotlib.
It's probably better to show just a histogram of randomly created variable on a distribution. This shows that the x
values that are around the mean value of your PDF function are the most likely ones to get accepted, and therefore more randomly created variables with those approximate values will be created.
Additionally I assume you would be interested in implementation of the kiss Random number generator. IT IS VERY IMPORTANT YOU HAVE A VERY GOOD GENERATOR. I dare to say to an extent kiss doesn't probably cut it (mersene twister is used often).
Random.h
#pragma once
#include <stdlib.h>
const unsigned RNG_MAX=4294967295;
namespace kiss{
// unsigned int kiss_z, kiss_w, kiss_jsr, kiss_jcong;
unsigned int RanUns();
void RunGen();
double Ran0(int upper_border);
double Ran(double bottom_border, double upper_border);
}
namespace Crand{
double Ran0(int upper_border);
double Ran(double bottom_border, double upper_border);
}
Kiss.cpp
#include "Random.h"
unsigned int kiss_z = 123456789; //od 1 do milijardu
unsigned int kiss_w = 378295763; //od 1 do milijardu
unsigned int kiss_jsr = 294827495; //od 1 do RNG_MAX
unsigned int kiss_jcong = 495749385; //od 0 do RNG_MAX
//KISS99*
//Autor: George Marsaglia
unsigned int kiss::RanUns()
{
kiss_z=36969*(kiss_z&65535)+(kiss_z>>16);
kiss_w=18000*(kiss_w&65535)+(kiss_w>>16);
kiss_jsr^=(kiss_jsr<<13);
kiss_jsr^=(kiss_jsr>>17);
kiss_jsr^=(kiss_jsr<<5);
kiss_jcong=69069*kiss_jcong+1234567;
return (((kiss_z<<16)+kiss_w)^kiss_jcong)+kiss_jsr;
}
void kiss::RunGen()
{
for (int i=0; i<2000; i++)
kiss::RanUns();
}
double kiss::Ran0(int upper_border)
{
unsigned velicinaIntervala = RNG_MAX / upper_border;
unsigned granicaIzbora= velicinaIntervala*upper_border;
unsigned slucajniBroj = kiss::RanUns();
while(slucajniBroj>=granicaIzbora)
slucajniBroj = kiss::RanUns();
return slucajniBroj/velicinaIntervala;
}
double kiss::Ran (double bottom_border, double upper_border)
{
return bottom_border+(upper_border-bottom_border)*kiss::Ran0(100000)/(100001.0);
}
Additionally there's the standard C random generators: CRands.cpp
#include "Random.h"
//standardni pseudo random generatori iz C-a
double Crand::Ran0(int upper_border)
{
return rand()%upper_border;
}
double Crand::Ran (double bottom_border, double upper_border)
{
return (upper_border-bottom_border)*rand()/((double)RAND_MAX+1);
}
It's worthy also to comment on the (b) graph above. When you have a very badly behaved PDF, PDF(x)
will vary significantly between large numbers and very small ones.
Issue with that is that the interval area Ch(x)
will match the extreme values of the PDF well, but since we create a random variable y
for small values of PDF(x)
as well; the chances of accepting that value are minute! It is more likely that the generated y
value will always be larger than PDF(x)
at that point. This means that you'll spend a lot of cycles creating numbers that won't get chosen and that all your chosen random numbers will be very locally bound to the max
of your PDF.
That's why it's often useful not to have the same Ch(x)
intervals everywhere, but to define a parametrized set of intervals. However this adds a fair bit of complexity to the code.
Where do you set your limits? How to deal with borderline cases? When and how to determine that you indeed need to suddenly use this approach? Calculating max
might not be as simple now, depending on the method you originally envisioned would be doing this.
Additionally now you have to correct for the fact that a lot more numbers get accepted more easily in the areas where your Ch(x)
box height is lower which skews the original PDF.
This can be corrected by weighing numbers created in the lowered boundary by the ratio of heights of higher and lower boundary, basically you repeat the y
step one more time. Create a random number z
from 0 to 1 and compare it to the ratio lower_height/higher_height, guaranteed to be <1. If z
is smaller than the ratio: accept x
and if it's larger reject.
Generalizations of code presented are also possible by writing a function, that takes in an object pointer instead. By defining your own class i.e. function
which would generally describe functions, have a eval method at a point, be able to store your parameters, calculate and store it's own max/min values and zero/cutoff points, you wouldn't have to pass, or define them in a function like I did.
Good Luck have fun!
回答2:
tl;dr: Raise a uniform 0 to 1 distribution to the power (1 - m) / m
where m
is the desired mean (between 0 and 1). Shift/scale as desired.
I was curious about how to implement this. I figured a trapezoid would be the easiest method, but then you're limited in that the most extreme mean you can get is with a triangle, which isn't that extreme. The math started getting hard, so I reverted to a purely empirical method that seems to work pretty well.
Anyways, for a distribution, how about starting with the uniform [0, 1) distribution and raising the values to some arbitrary power. Square them and the distribution shifts to the right. Square root them and they shift to the left. You can go to whatever extreme you want and shove the distribution as hard as you want.
def randompow(p):
return random.random() ** p
(Everything's written in Python, but should be easy enough to translate. If something's unclear, just ask. random.random()
returns floats from 0 to 1)
So, how do we adjust that power? Well, how's the mean seem to shift with varying powers?
Looks like some sort of sigmoid curve. There are lots of sigmoid functions, but hyperbolic tangent seems to work pretty well.
Not 100% there, lets try to scale it in the X direction...
# x are the values from -3 to 3 (log transformed from the powers used)
# y are the empirically-determined means given all those powers
def fitter(tanscale):
xsc = tanscale * x
sigtan = np.tanh(xsc)
sigtan = (1 - sigtan) / 2
resid = sigtan - y
return sum(resid**2)
fit = scipy.optimize.minimize(fitter, 1)
The fitter says the best scaling factor is 1.1514088816214016. The residuals are actually pretty low, so sounds good.
Implementing the inverse of all the math I didn't talk about looks like:
def distpow(mean):
p = 1 - (mean * 2)
p = np.arctanh(p) / 1.1514088816214016
return 10**p
That gives us the power to use in the first function to get whatever mean to the distribution. A factory function can return a method to churn out a bunch of numbers from the distribution with the desired mean
def randommean(mean):
p = distpow(mean)
def f():
return random.random() ** p
return f
How's it do? Reasonably well out to 3-4 decimals:
for x in [0.01, 0.1, 0.2, 0.4, 0.5, 0.6, 0.8, 0.9, 0.99]:
f = randommean(x)
# sample the distribution 10 million times
mean = np.mean([f() for _ in range(10000000)])
print('Target mean: {:0.6f}, actual: {:0.6f}'.format(x, mean))
Target mean: 0.010000, actual: 0.010030
Target mean: 0.100000, actual: 0.100122
Target mean: 0.200000, actual: 0.199990
Target mean: 0.400000, actual: 0.400051
Target mean: 0.500000, actual: 0.499905
Target mean: 0.600000, actual: 0.599997
Target mean: 0.800000, actual: 0.799999
Target mean: 0.900000, actual: 0.899972
Target mean: 0.990000, actual: 0.989996
A more succinct function that just gives you a value given a mean (not a factory function):
def randommean(m):
p = np.arctanh(1 - (2 * m)) / 1.1514088816214016
return random.random() ** (10 ** p)
Edit: fitting against the natural log of the mean instead of log10 gave a residual suspiciously close to 0.5. Doing some math to simplify out the arctanh gives:
def randommean(m):
'''Return a value from the distribution 0 to 1 with average *m*'''
return random.random() ** ((1 - m) / m)
From here it should be fairly easy to shift, rescale, and round off the distribution. The truncating-to-integer might end up shifting the mean by 1 (or half a unit?), so that's an unsolved problem (if it matters).