Proposal: locally unique GUID alternative

2019-02-26 20:30发布

问题:

The problem

I'm looking for feedback on this exploration of a locally unique alternative for GUIDs, with the following requirements:

  1. Has very low chance of collisions (to the point that we'd rather collide once a year than perform checks)
  2. Does not leak sensitive information, such as how many items exist
  3. Has high performance in an SQL database
  4. Is copy/pastable for manual querying (as both query string and query result)
  5. Is usable as a URI component without encoding

To meet the requirements, I have decided on the form of a 64-bit unsigned integer. It is easy on the CPU, nice and small for primary key use, semi-human-readable, digits-only, and easy to copy/paste when querying by hand. (As a counter-example, BLOBs severely hinder manual querying on most SQL databases.)

Additionally, Percona demonstrates that monotonically increasing values perform much better as primary keys, particularly on insertion speed, so this is a trait to aim for.

Proposed structure

From left to right, most significant bits being on the left

  1. 46 bits. Timestamp. Unix time in milliseconds. (At least in C#, sub-millisecond time is not readily available.) This will last until somewhere in the year 4199. It gives us monotonically increasing values.
  2. 8 bits. Part of local IP. The last component of the machine's internal IP address, of the fastest available network interface. Should be ethernet LAN for most servers.
  3. 10 bits. Uniquefier. A static counter that is incremented (interlocked) on use, with wrap-around.

Collisions

There is a 1/1024 (~0.1%) collision chance whenever:

  1. Two systems share the same last IP address component and make a call in the same millisecond. This can be completely avoided.
  2. A system's clock is turned back and it makes a call on the same millisecond of a call from before the time change. This should be a highly infrequent situation that seems within the requirements.

Limitations

Interestingly, we seem to be meeting the requirements (#2 being a dodgy one). Let's take a look at some of the limitations.

  1. Servers' local IP addresses must be carefully maintained - even amongst different data centers, if applicable.
  2. We cannot support more than 255 servers - possibly fewer, if other constraints on the IPs exist.
  3. We leak information about which identifiers were created by the same server. I believe this is the case with many GUID implementations also, however.
  4. Information can be obtained about traffic volumes, by checking counter increments between a user's own requests. The effectiveness is diminished by the fact that the counter is used for various kinds of data, increasing rapidly and in a way that is hard to ascribe to any particular type of data.
  5. Identifiers are much more guessable than ones that have plenty of randomness. A brute-force attack would need about 512 calls (uniquefier) per attempted millisecond. Ideally, this attack yields nothing, i.e. the system reports "unauthorized" regardless of whether the identifier is non-existent or does not belong to the user, and is resistant to timing attacks. Realistically, let's assume a dedicated attacker will find a leak.

Considerations

  1. Limitations #1 and #2 must simply fit the company.

  2. Limitation #3 appears to be deemed acceptable in existing GUID implementations, and is one I am willing to live with.

  3. Limitation #4 is a tricky one. How sensitive is this information? "So we do a 10K inserts per minute, into an unknown number of tables." Relative volumes do grant more insight: "Between 08:00-09:00 there is twice as much activity as an hour earlier." Still, this will usually be common knowledge in a given field. Unexpected peaks might leak some more info. "So the system works hard at 03:00 in the morning." How bad is all this? Judging by the number of companies that expose auto-increment identifiers, we might say it's an improvement more often than not... But it could be a deal-breaker.

  4. We could use (crypto)random bits as the uniquefier to deal with limitation #4, but that would introduce a third collision opportunity: whenever the system generates multiple identifiers within a millisecond. The birthday paradox is particularly problematic there.

  5. We could free up 2 bits if we allow the timestamp to wrap around in 2527 already. Selfish and insensitive to future generations, or arrogant to assume that our code would be used longer? :-)

What else?

I welcome your feedback, improvements, ideas, limitations that I have missed! How would you solve this problem?

回答1:

At risk of being that guy that responds with "why would you want to do that?" -I am wondering what your underlying business problem is, that prevents you from using a GUID?

BIGINT, GUID, and HashTables..

I use a BIGINT for primary key which keeps everything sequential, unbloated and fast. This is for all internal work i.e. within my stored procedures, on SQL joins etc. Then I have a hash table with GUID's which becomes the starting point from external callers.

Since I use table inheritance, the BIGINT ID's can be used as a sequential primary key in my hash table as all ID's are unique across the entire database (yet still sequential). Then to take it further I create a composite key on the hash table that includes the last several digits of the GUID, and then I partition the hash table on those values so that each is stored separately on disk and still sequential, yet gives me a natural way to index on the GUID I'm looking up.

When I originally started doing it this way, I posted a how-to (excluding the partitioning part) here:

What is the fastest way to look for duplicate uniqueidentifier in Sql Server?

Initial performance tests were lightening fast against 100,000,000 records.

Not the answer to your question, but perhaps worth 2 cents to somebody.