Which part of a GUID is most worth keeping?

2019-01-23 16:34发布

问题:

I need to generate a unique ID and was considering Guid.NewGuid to do this, which generates something of the form:

0fe66778-c4a8-4f93-9bda-366224df6f11

This is a little long for the string-type database column that it will end up residing in, so I was planning on truncating it.

The question is: Is one end of a GUID more preferable than the rest in terms of uniqueness? Should I be lopping off the start, the end, or removing parts from the middle? Or does it just not matter?

回答1:

You can save space by using a base64 string instead:

var g = Guid.NewGuid();
var s = Convert.ToBase64String(g.ToByteArray());

Console.WriteLine(g);
Console.WriteLine(s);

This will save you 12 characters (8 if you weren't using the hyphens).



回答2:

Keep all of it.

From the above link:

* Four bits to encode the computer number,
* 56 bits for the timestamp, and
* four bits as a uniquifier.

you can redefine the Guid to right-size it to your needs.



回答3:

If the GUID were simply a random number, you could keep an arbitrary subset of the bits and suffer a certain percent chance of collision that you can calculate with the "birthday algorithm":

double numBirthdays = 365;  // set to e.g. 18446744073709551616d for 64 bits
double numPeople = 23;      // set to the maximum number of GUIDs you intend to store
double probability = 1; // that all birthdays are different 
for (int x = 1; x < numPeople; x++) 
   probability *= (double)(numBirthdays - x) / numBirthdays; 

Console.WriteLine("Probability that two people have the same birthday:");
Console.WriteLine((1 - probability).ToString());

However, often the probability of a collision is higher because, as a matter of fact, GUIDs are in general NOT random. According to Wikipedia's GUID article there are five types of GUIDs. The 13th digit specifies which kind of GUID you have, so it tends not to vary much, and the top two bits of the 17th digit are always fixed at 01.

For each type of GUID you'll get different degrees of randomness. Version 4 (13th digit = 4) is entirely random except for digits 13 and 17; versions 3 and 5 are effectively random, as they are cryptographic hashes; while versions 1 and 2 are mostly NOT random but certain parts are fairly random in practical cases. A "gotcha" for version 1 and 2 GUIDs is that many GUIDs could come from the same machine and in that case will have a large number of identical bits (in particular, the last 48 bits and many of the time bits will be identical). Or, if many GUIDs were created at the same time on different machines, you could have collisions between the time bits. So, good luck safely truncating that.

I had a situation where my software only supported 64 bits for unique IDs so I couldn't use GUIDs directly. Luckily all of the GUIDs were type 4, so I could get 64 bits that were random or nearly random. I had two million records to store, and the birthday algorithm indicated that the probability of a collision was 1.08420141198273 x 10^-07 for 64 bits and 0.007 (0.7%) for 48 bits. This should be assumed to be the best-case scenario, since a decrease in randomness will usually increase the probability of collision.

I suppose that in theory, more GUID types could exist in the future than are defined now, so a future-proof truncation algorithm is not possible.



回答4:

Truncating a GUID is a bad idea, please see this article for why.

You should consider generating a shorter GUID, as google reveals some solutions for. These solutions seem to involve taking a GUID and changing it to be represented in full 255 bit ascii.



回答5:

I agree with Rob - Keep all of it.

But since you said you're going into a database, I thought I'd point out that just using Guid's doesn't necessarily mean that it will index well in a database. For that reason, the NHibernate developers created a Guid.Comb algorithm that's more DB friendly.

See NHibernate POID Generators revealed and documentation on the Guid Algorithms for more information.

NOTE: Guid.Comb is designed to improve performance on MsSQL