I have a combinatorics problem for which I want to be able to pick an integer at random between 0 and a big integer.
Inadequacies of my current approach
Now for regular integers I would usually write something like int rand 500;
and be done with it.
But for big integers, it looks like rand
isn't meant for this.
Using the following code, I ran a simulation of 2 million calls to rand $bigint
:
$ perl -Mbigint -E 'say int rand 1230138339199329632554990773929330319360000000 for 1 .. 2e6' > rand.txt
The distribution of the resultant set is far from desirable:
- 0 (56 counts)
- magnitude 1e+040 (112 counts)
- magnitude 1e+041 (1411 counts)
- magnitude 1e+042 (14496 counts)
- magnitude 1e+043 (146324 counts)
- magnitude 1e+044 (1463824 counts)
- magnitude 1e+045 (373777 counts)
So the process was never able to choose a number like 999
, or 5e+020
, which makes this approach unsuitable for what I want to do.
It looks like this has something to do with the arbitrary precision of rand
, which never goes beyond 15 digits in the course of my testing:
$ perl -E 'printf "%.66g", rand'
0.307037353515625
How can I overcome this limitation?
My initial thought is that maybe there is a way to influence the precision of rand
, but it feels like a band-aid to a much bigger problem (i.e. the inability of rand
to handle big integers).
In any case, I'm hoping someone has walked down this path before and knows how to remedy the situation.
(Converted from my comment)
A more theoretical-driven approach would be using multiple calls to the PRNG to create enough random-bits for your number to sample. Care has to be taken, if the number of bits produced by some PRNG is not equal to the number of bits needed as outlined below!
Pseudocode
- Calculate the bits needed to represent your number:
n_needed_bits
- Check the size of bits returned by your PRNG:
n_bits_prng
- Calculate the number of samples needed:
needed_prng_samples = ceil(n_needed_bits / n_bits_prng)
- While true:
- Sample
needed_prng_samples
(calls to PRNG) times & concatenate all the bits obtained
- Check if the resulting number is within your range
- Yes?: return number (finished)
- No?: do nothing (loop continues; will resample all components again!)
Remarks
- This is a form of acceptance-sampling / rejection-sampling
- The approach is a Las-vegas type of algorithm: the runtime is not bounded in theory
- The number of loops needed is in average:
n_possible-sample-numbers-of-full-concatenation / n_possible-sample-numbers-within-range
- The complete resampling (if result not within range) according to the rejection-method is giving access to more formal-analysis of non-bias / uniformity and is a very important aspect for this approach
- Of course the classic assumptions in regards to PRNG-output are needed to make this work.
- If the PRNG for example has some non-uniformity in regards to low-bits / high-bits (as often mentioned), this will have an effect of the output above
I was looking at this problem from the wrong angle
The bins are not the same size. Each bin is 10 times the size of the previous one. To put this in perspective, there are 10,000 possible integers at magnitude 1e+44
for every integer with magnitude 1e+40
.
The probability of finding any number with magnitude 1e+20
for the bigint at 1e+45
is less than 0.00000 00000 00000 00000 001 %
.
Forget needles in haystacks, this is more like finding a needle in a quasar!
An approach can be to cut string representation of the number into chunks, a boolean ($low) initialized is false while first random draws are equal to upper bound.
EDIT: added some explanations following comment
# first argument (in) upper bound
# second argument (in/out) is lower (false while random returns upper bound, after it remains true)
sub randhlp {
my($upp)=@_;
my $l=length $upp;
# random number less than
# - upper bound if islower is false
# - 9..99 otherwise
my $x=int rand ($_[1] ? 10**$l : $upp+1);
if ($x<$upp) {
$_[1]=1;
}
# left padding with 0
return sprintf("%0*d",$l,$x);
}
# returns a random number less than argument (numeric string)
sub randistr {
my($n)=@_;
$n=~/^\d+$/ or die "invalid input not numeric";
$n ne "0" or die "invalid input 0";
my($low,$x);
do {
undef $x;
# split string by chunks of 6 characters
# except last chunk which has 1 to 6 characters
while ($n=~/.{1,6}/g) {
# concatenate random results
$x.=randhlp($&,$low)
}
} while ($x eq $n);
$x=~s/^0+//;
return $x;
}
The test
for ($i=0;$i<2e6;++$i) {
$H{length(randistr("1230138339199329632554990773929330319360000000"))}+=1;
}
print "$_ $H{$_}\n" for sort keys %H;
Returns
39 4
40 61
41 153
42 1376
43 14592
44 146109
45 1463301
46 374404