Ok this is one of those trickier than it sounds questions so I'm turning to stack overflow because I can't think of a good answer. Here is what I want: I need Python to generate a simple a list of numbers from 0 to 1,000,000,000 in random order to be used for serial numbers (using a random number so that you can't tell how many have been assigned or do timing attacks as easily, i.e. guessing the next one that will come up). These numbers are stored in a database table (indexed) along with the information linked to them. The program generating them doesn't run forever so it can't rely on internal state.
No big deal right? Just generate a list of numbers, shove them into an array and use Python "random.shuffle(big_number_array)" and we're done. Problem is I'd like to avoid having to store a list of numbers (and thus read the file, pop one off the top, save the file and close it). I'd rather generate them on the fly. Problem is that the solutions I can think of have problems:
1) Generate a random number and then check if it has already been used. If it has been used generate a new number, check, repeat as needed until I find an unused one. Problem here is that I may get unlucky and generate a lot of used numbers before getting one that is unused. Possible fix: use a very large pool of numbers to reduce the chances of this (but then I end up with silly long numbers).
2) Generate a random number and then check if it has already been used. If it has been used add or subtract one from the number and check again, keep repeating until I hit an unused number. Problem is this is no longer a random number as I have introduced bias (eventually I will get clumps of numbers and you'd be able to predict the next number with a better chance of success).
3) Generate a random number and then check if it has already been used. If it has been used add or subtract another randomly generated random number and check again, problem is we're back to simply generating random numbers and checking as in solution 1.
4) Suck it up and generate the random list and save it, have a daemon put them into a Queue so there are numbers available (and avoid constantly opening and closing a file, batching it instead).
5) Generate much larger random numbers and hash them (i.e. using MD5) to get a smaller numeric value, we should rarely get collisions, but I end up with larger than needed numbers again.
6) Prepend or append time based information to the random number (i.e. unix timestamp) to reduce chances of a collision, again I get larger numbers than I need.
Anyone have any clever ideas that will reduce the chances of a "collision" (i.e. generating a random number that is already taken) but will also allow me to keep the number "small" (i.e. less than a billion (or a thousand million for your europeans =)).
Answer and why I accepted it:
So I will simply go with 1, and hope it's not an issue, however if it is I will go with the deterministic solution of generating all the numbers and storing them so that there is a guarentee of getting a new random number, and I can use "small" numbers (i.e. 9 digits instead of an MD5/etc.).
I think you are overestimating the problems with approach 1). Unless you have hard-realtime requirements just checking by random choice terminates rather fast. The probability of needing more than a number of iterations decays exponentially. With 100M numbers outputted (10% fillfactor) you'll have one in billion chance of requiring more than 9 iterations. Even with 50% of numbers taken you'll on average need 2 iterations and have one in a billion chance of requiring more than 30 checks. Or even the extreme case where 99% of the numbers are already taken might still be reasonable - you'll average a 100 iterations and have 1 in a billion change of requiring 2062 iterations
I'd rethink the problem itself... You don't seem to be doing anything sequential with the numbers... and you've got an index on the column which has them. Do they actually need to be numbers?
Consider a sha hash... you don't actually need the entire thing. Do what git or other url shortening services do, and take first 3/4/5 characters of the hash. Given that each character now has 36 possible values instead of 10, you have 2,176,782,336 combinations instead of 999,999 combinations (for six digits). Combine that with a quick check on whether the combination exists (a pure index query) and a seed like a timestamp + random number and it should do for almost any situation.
You are stating that you store the numbers in a database.
Wouldn't it then be easier to store all the numbers there, and ask the database for a random unused number? Most databases support such a request.
Examples
MySQL:
PostgreSQL:
You could use Format-Preserving Encryption to encrypt a counter. Your counter just goes from 0 upwards, and the encryption uses a key of your choice to turn it into a seemingly random value of whatever radix and width you want.
Block ciphers normally have a fixed block size of e.g. 64 or 128 bits. But Format-Preserving Encryption allows you to take a standard cipher like AES and make a smaller-width cipher, of whatever radix and width you want (e.g. radix 10, width 9 for the parameters of the question), with an algorithm which is still cryptographically robust.
It is guaranteed to never have collisions (because cryptographic algorithms create a 1:1 mapping). It is also reversible (a 2-way mapping), so you can take the resulting number and get back to the counter value you started with.
AES-FFX is one proposed standard method to achieve this.
I've experimented with some basic Python code for AES-FFX--see Python code here (but note that it doesn't fully comply with the AES-FFX specification). It can e.g. encrypt a counter to a random-looking 7-digit decimal number. E.g.:
For another example in Python, using another non-AES-FFX (I think) method, see this blog post "How to Generate an Account Number" which does FPE using a Feistel cipher. It generates numbers from 0 to 2^32-1.
With some modular arithmic and prime numbers, you can create all numbers between 0 and a big prime, out of order.
If you choose your numbers carefully, the next number is hard to guess.Do you need this to be cryptographically secure or just hard to guess? How bad are collisions? Because if it needs to be cryptographically strong and have zero collisions, it is, sadly, impossible.