What is the probability of collision with a 6 digi

2019-03-18 09:15发布

问题:

I'm using the following perl code to generate random alphanumeric strings (uppercase letters and numbers, only) to use as unique identifiers for records in my MySQL database. The database is likely to stay under 1,000,000 rows, but the absolute realistic maximum would be around 3,000,000. Do I have a dangerous chance of 2 records having the same random code, or is it likely to happen an insignificantly small number of times? I know very little about probability (if that isn't already abundantly clear from the nature of this question) and would love someone's input.

perl -le 'print map { ("A".."Z", 0..9)[rand 36] } 1..6'

回答1:

Because of the Birthday Paradox it's more likely than you might think.

There are 2,176,782,336 possible codes, but even inserting just 50,000 rows there is already a quite high chance of a collision. For 1,000,000 rows it is almost inevitable that there will be many collisions (I think about 250 on average).

I ran a few tests and this is the number of codes I could generate before the first collision occurred:

  • 73366
  • 59307
  • 79297
  • 36909

Collisions will become more frequent as the number of codes increases.

Here was my test code (written in Python):

>>> import random
>>> codes = set()
>>> while 1:
        code=''.join(random.choice('1234567890qwertyuiopasdfghjklzxcvbnm')for x in range(6))
        if code in codes: break
        codes.add(code)

>>> len(codes)
36909


回答2:

Well, you have 36**6 possible codes, which is about 2 billion. Call this d. Using a formula found here, we find that the probability of a collision, for n codes, is approximately

1 - ((d-1)/d)**(n*(n-1)/2)

For any n over 50,000 or so, that's pretty high.

Looks like a 10-character code has a collision probability of only about 1/800. So go with 10 or more.



回答3:

Based on the equations given at http://en.wikipedia.org/wiki/Birthday_paradox#Approximation_of_number_of_people, there is a 50% chance of encountering at least one collision after inserting only 55,000 records or so into a universe of this size:

http://wolfr.am/niaHIF

Trying to insert two to six times as many records will almost certainly lead to a collision. You'll need to assign codes nonrandomly, or use a larger code.



回答4:

As mentioned previously, the birthday paradox makes this event quite likely. In particular, a accurate approximation can be determined when the problem is cast as a collision problem. Let p(n; d) be the probability that at least two numbers are the same, d be the number of combinations and n the number of trails. Then, we can show that p(n; d) is approximately equal to:

1 - ((d-1)/d)^(n*(n-1)/2)

We can easily plot this in R:

> d = 2176782336
> n = 1:100000
> plot(n,1 - ((d-1)/d)^(n*(n-1)/2), type='l')

which gives

As you can see the collision probability increases very quickly with the number of trials/rows



回答5:

While I don't know the specifics of exactly how you want to use these pseudo-random IDs, you may want to consider generating an array of 3000000 integers (from 1 to 3000000) and randomly shuffling it. That would guarantee that the numbers are unique. See Fisher-Yates shuffle on Wikipedia.



回答6:

A caution: Beware of relying on the built-in rand where the quality of the pseudo random number generator matters. I recently found out about Math::Random::MT::Auto:

The Mersenne Twister is a fast pseudorandom number generator (PRNG) that is capable of providing large volumes (> 10^6004) of "high quality" pseudorandom data to applications that may exhaust available "truly" random data sources or system-provided PRNGs such as rand.

The module provides a drop in replacement for rand which is handy.

You can generate the sequence of keys with the following code:

#!/usr/bin/env perl

use warnings; use strict;
use Math::Random::MT::Auto qw( rand );

my $SEQUENCE_LENGTH = 1_000_000;
my %dict;
my $picks;

for my $i (1 .. $SEQUENCE_LENGTH) {
    my $pick = pick_one();
    $picks += 1;

    redo if exists $dict{ $pick };

    $dict{ $pick } = undef;
}

printf "Generated %d keys with %d picks\n", scalar keys %dict, $picks;

sub pick_one {
    join '', map { ("A".."Z", 0..9)[rand 36] } 1..6;
}

Some time ago, I wrote about the limited range of built-in rand on Windows. You may not be on Windows, but there might be other limitations or pitfalls on your system.