I have been experimenting with using UUIDs as database keys. I want to take up the least amount of bytes as possible, while still keeping the UUID representation human readable.
I think that I have gotten it down to 22 bytes using base64 and removing some trailing "==" that seem to be unnecessary to store for my purposes. Are there any flaws with this approach?
Basically my test code does a bunch of conversions to get the UUID down to a 22 byte String, then converts it back into a UUID.
import java.io.IOException;
import java.util.UUID;
public class UUIDTest {
public static void main(String[] args){
UUID uuid = UUID.randomUUID();
System.out.println("UUID String: " + uuid.toString());
System.out.println("Number of Bytes: " + uuid.toString().getBytes().length);
System.out.println();
byte[] uuidArr = asByteArray(uuid);
System.out.print("UUID Byte Array: ");
for(byte b: uuidArr){
System.out.print(b +" ");
}
System.out.println();
System.out.println("Number of Bytes: " + uuidArr.length);
System.out.println();
try {
// Convert a byte array to base64 string
String s = new sun.misc.BASE64Encoder().encode(uuidArr);
System.out.println("UUID Base64 String: " +s);
System.out.println("Number of Bytes: " + s.getBytes().length);
System.out.println();
String trimmed = s.split("=")[0];
System.out.println("UUID Base64 String Trimmed: " +trimmed);
System.out.println("Number of Bytes: " + trimmed.getBytes().length);
System.out.println();
// Convert base64 string to a byte array
byte[] backArr = new sun.misc.BASE64Decoder().decodeBuffer(trimmed);
System.out.print("Back to UUID Byte Array: ");
for(byte b: backArr){
System.out.print(b +" ");
}
System.out.println();
System.out.println("Number of Bytes: " + backArr.length);
byte[] fixedArr = new byte[16];
for(int i= 0; i<16; i++){
fixedArr[i] = backArr[i];
}
System.out.println();
System.out.print("Fixed UUID Byte Array: ");
for(byte b: fixedArr){
System.out.print(b +" ");
}
System.out.println();
System.out.println("Number of Bytes: " + fixedArr.length);
System.out.println();
UUID newUUID = toUUID(fixedArr);
System.out.println("UUID String: " + newUUID.toString());
System.out.println("Number of Bytes: " + newUUID.toString().getBytes().length);
System.out.println();
System.out.println("Equal to Start UUID? "+newUUID.equals(uuid));
if(!newUUID.equals(uuid)){
System.exit(0);
}
} catch (IOException e) {
}
}
public static byte[] asByteArray(UUID uuid) {
long msb = uuid.getMostSignificantBits();
long lsb = uuid.getLeastSignificantBits();
byte[] buffer = new byte[16];
for (int i = 0; i < 8; i++) {
buffer[i] = (byte) (msb >>> 8 * (7 - i));
}
for (int i = 8; i < 16; i++) {
buffer[i] = (byte) (lsb >>> 8 * (7 - i));
}
return buffer;
}
public static UUID toUUID(byte[] byteArray) {
long msb = 0;
long lsb = 0;
for (int i = 0; i < 8; i++)
msb = (msb << 8) | (byteArray[i] & 0xff);
for (int i = 8; i < 16; i++)
lsb = (lsb << 8) | (byteArray[i] & 0xff);
UUID result = new UUID(msb, lsb);
return result;
}
}
output:
UUID String: cdaed56d-8712-414d-b346-01905d0026fe
Number of Bytes: 36
UUID Byte Array: -51 -82 -43 109 -121 18 65 77 -77 70 1 -112 93 0 38 -2
Number of Bytes: 16
UUID Base64 String: za7VbYcSQU2zRgGQXQAm/g==
Number of Bytes: 24
UUID Base64 String Trimmed: za7VbYcSQU2zRgGQXQAm/g
Number of Bytes: 22
Back to UUID Byte Array: -51 -82 -43 109 -121 18 65 77 -77 70 1 -112 93 0 38 -2 0 38
Number of Bytes: 18
Fixed UUID Byte Array: -51 -82 -43 109 -121 18 65 77 -77 70 1 -112 93 0 38 -2
Number of Bytes: 16
UUID String: cdaed56d-8712-414d-b346-01905d0026fe
Number of Bytes: 36
Equal to Start UUID? true
I have an application where I'm doing almost exactly this. 22 char encoded UUID. It works fine. However, the main reason I'm doing it this way is that the IDs are exposed in the web app's URIs, and 36 characters is really quite big for something that appears in a URI. 22 characters is still kinda long, but we make do.
Here's the Ruby code for this:
It's not exactly the same as base64 encoding because base64 uses characters that would have to be escaped if they appeared in a URI path component. The Java implementation is likely to be quite different since you're more likely to have an array of raw bytes instead of a really big integer.
You don't say what DBMS you're using, but it seems that RAW would be the best approach if you're concerned about saving space. You just need to remember to convert for all queries, or you'll risk a huge performance drop.
But I have to ask: are bytes really that expensive where you live?
Below is what I use for a UUID (Comb style). It includes code for converting a uuid string or uuid type to base64. I do it per 64 bits, so I don't deal with any equal signs:
JAVA
Here's my code, it uses org.apache.commons.codec.binary.Base64 to produce url-safe unique strings that are 22 characters in length (and that have the same uniqueness as UUID).
I was also trying to do something similar. I am working with a Java application which uses UUIDs of the form
6fcb514b-b878-4c9d-95b7-8dc3a7ce6fd8
(which are generated with the standard UUID lib in Java). In my case I needed to be able to get this UUID down to 30 characters or less. I used Base64 and these are my convenience functions. Hopefully they will be helpful for someone as the solution was not obvious to me right away.Usage:
Output:
Functions:
You can safely drop the padding "==" in this application. If you were to decode the base-64 text back to bytes, most libraries would expect it to be there, but since you are just using the resulting string as a key, it's not a problem.
I like Base-64 because its limited character set looks less like gibberish, but there's also Base-85. It uses more characters and codes 4 bytes as 5 characters, so you could get your text down to 20 characters.