Say I have a hash algorithm, and it's nice and smooth (The odds of any one hash value coming up are the same as any other value).
Now say that I know that the odds of picking 2 hashes and there being a collision are (For arguments sake) 50000:1.
Now say I pick 100 hashes. How do I calculate the odds of a collision within that set of 100 values, given the odds of a collision in a set of 2?
What is the general solution to this, so that I can come up with a number of hash attempts after which the odds fall below some acceptable threshold? E.g. I can say things like "A batch of 49999 hash value creations has a high chance of collision".
First calculate the probability that there is not a collision:
That sounds a lot like the Birthday Paradox to me.
You should be able to just substitute the set of possible birthdays (365) with the possible hashes (50000) and run the same calculations they present there.
If you modify the python script presented in the article for your values:
You end up with the odds of collision on two numbers of 0.00002. Around 265 samples, you have around a 50% chance of having a collision.
and in JS
This is called the Birthday problem. To solve it, think instead about the odds of there being no collisions (call it pnc).
This is a generalization of the Birthday problem.