The problem
I'm looking for feedback on this exploration of a locally unique alternative for GUIDs, with the following requirements:
- Has very low chance of collisions (to the point that we'd rather collide once a year than perform checks)
- Does not leak sensitive information, such as how many items exist
- Has high performance in an SQL database
- Is copy/pastable for manual querying (as both query string and query result)
- Is usable as a URI component without encoding
To meet the requirements, I have decided on the form of a 64-bit unsigned integer. It is easy on the CPU, nice and small for primary key use, semi-human-readable, digits-only, and easy to copy/paste when querying by hand. (As a counter-example, BLOBs severely hinder manual querying on most SQL databases.)
Additionally, Percona demonstrates that monotonically increasing values perform much better as primary keys, particularly on insertion speed, so this is a trait to aim for.
Proposed structure
From left to right, most significant bits being on the left
- 46 bits. Timestamp. Unix time in milliseconds. (At least in C#, sub-millisecond time is not readily available.) This will last until somewhere in the year 4199. It gives us monotonically increasing values.
- 8 bits. Part of local IP. The last component of the machine's internal IP address, of the fastest available network interface. Should be ethernet LAN for most servers.
- 10 bits. Uniquefier. A static counter that is incremented (interlocked) on use, with wrap-around.
Collisions
There is a 1/1024 (~0.1%) collision chance whenever:
- Two systems share the same last IP address component and make a call in the same millisecond. This can be completely avoided.
- A system's clock is turned back and it makes a call on the same millisecond of a call from before the time change. This should be a highly infrequent situation that seems within the requirements.
Limitations
Interestingly, we seem to be meeting the requirements (#2 being a dodgy one). Let's take a look at some of the limitations.
- Servers' local IP addresses must be carefully maintained - even amongst different data centers, if applicable.
- We cannot support more than 255 servers - possibly fewer, if other constraints on the IPs exist.
- We leak information about which identifiers were created by the same server. I believe this is the case with many GUID implementations also, however.
- Information can be obtained about traffic volumes, by checking counter increments between a user's own requests. The effectiveness is diminished by the fact that the counter is used for various kinds of data, increasing rapidly and in a way that is hard to ascribe to any particular type of data.
- Identifiers are much more guessable than ones that have plenty of randomness. A brute-force attack would need about 512 calls (uniquefier) per attempted millisecond. Ideally, this attack yields nothing, i.e. the system reports "unauthorized" regardless of whether the identifier is non-existent or does not belong to the user, and is resistant to timing attacks. Realistically, let's assume a dedicated attacker will find a leak.
Considerations
Limitations #1 and #2 must simply fit the company.
Limitation #3 appears to be deemed acceptable in existing GUID implementations, and is one I am willing to live with.
Limitation #4 is a tricky one. How sensitive is this information? "So we do a 10K inserts per minute, into an unknown number of tables." Relative volumes do grant more insight: "Between 08:00-09:00 there is twice as much activity as an hour earlier." Still, this will usually be common knowledge in a given field. Unexpected peaks might leak some more info. "So the system works hard at 03:00 in the morning." How bad is all this? Judging by the number of companies that expose auto-increment identifiers, we might say it's an improvement more often than not... But it could be a deal-breaker.
We could use (crypto)random bits as the uniquefier to deal with limitation #4, but that would introduce a third collision opportunity: whenever the system generates multiple identifiers within a millisecond. The birthday paradox is particularly problematic there.
We could free up 2 bits if we allow the timestamp to wrap around in 2527 already. Selfish and insensitive to future generations, or arrogant to assume that our code would be used longer? :-)
What else?
I welcome your feedback, improvements, ideas, limitations that I have missed! How would you solve this problem?