This is for the purpose of having a nice short URL which refers to an md5 hash in a database. I would like to convert something like this:
a7d2cd9e0e09bebb6a520af48205ced1
into something like this:
hW9lM5f27
Those both contain about the same amount of information. The method doesn't have to be direct and reversible but that would be nice (more flexible). At the least I would want a randomly generated string with the hex hash as the seed so it is reproducible. I'm sure there are many possible answers, I am curious to see how people would do it in an elegant way.
Oh, this doesn't have to have perfect 1:1 correspondence with the original hash but that would be a bonus (I guess I already implied that with the reversibility criteria). And I would like to avoid collisions if possible.
EDIT I realized my initial calculations were totally wrong (thanks to the people answering here but it took me awhile to clue in) and you can't really reduce the string length very much by throwing in all the lower case and uppercase letters into the mix. So I guess I will want something that doesn't directly convert from hex to base 62.
Here's a little function for consideration:
Basically you have 16-bytes of data in the MD5 hash string. It's 32 chars long because each byte is encoded as 2 hex digits (i.e. 00-FF). So we break them up into bytes and build up a 16-byte string of it. But because this is no longer human-readable or valid ASCII, we base-64 encode it back to readable chars. But since base-64 results in ~4/3 expansion (we only output 6 bits per 8 bits of input, thus requiring 32 bits to encode 24 bits), the 16-bytes becomes 22 bytes. But because base-64 encoding typically pads to lengths multiples of 4, we can take only the first 22 chars of the 24 character output (the last 2 of which are padding). Then we replace non-URL-safe characters used by base-64 encoding with URL-safe equivalents.
This is fully reversible, but that is left as an exercise to the reader.
I think this is the best you can do, unless you don't care about human-readable/ASCII, in which case you can just use $md5_bin_str directly.
And also you can use a prefix or other subset of the result from this function if you don't need to preserve all the bits. Throwing out data is obviously the simplest way to shorten things! (But then it's not reversible)
P.S. for your input of "a7d2cd9e0e09bebb6a520af48205ced1" (32 chars), this function will return "VUDNng4JvrtqUgr0QwXO0Q" (22 chars).
I would advise against a 1-1 correspondence:
With base-64 encoding you will only be able to decrease the input to (4/8)/(6/8) -> 4/6 ~ 66% in size (and this is assuming you deal with the "ugly" base64 characters without adding anything new).
I would likely consider a (secondary) look-up method to get truly "pretty" values. Once you have this alternate method established, choosing how to generate values in that range -- e.g. random numbers -- can be free of the source hash value (because correspondence is lost anyway) and an arbitrary "pretty" target-set can be used, perhaps [a-z][A-Z][0-9].
You can convert to the base (62 above) by simply following the divide-and-carry method and a look-up into an array. It should be fun little exercise.
Note: If you choose the random number from [0, 62^5) then you will get a value that will completely pack the encoded output (and fit within 32-bit integer values). You can then perform this process multiple times in a row to get a nice multiple of-5 result value, such as xxxxxyyyyyzzzzzz (where x,y,z are different groups and the total value is in the range (62^5)^3 -> 62^15 -> "a huge value")
Edit, for comment:
Because without the 1-1 correspondence you can make truly short pretty things -- perhaps as "small" as 8 characters long -- with base62, 8 characters can store up to 218340105584896 values, which is likely more than you will ever need. Or even 6 characters which "only" allows storage of 56800235584 different values! (And you still can't store that number in a plain 32-bit integer :-) If you drop to 5 characters, you once again reduce the space (to just under one billion: 916,132,832), but now you have something that can fit in a signed 32-bit integer (albeit it is somewhat wasteful).
The DB should ensure no duplicates, albeit an index on this value will be "fast-fragmenting" with a random source (but you could use counters or whatnot). A well-distributed PRNG should have minimal conflicts (read: retries) in a large enough range (assuming you keep the seed rolling and don't reset it, or reset it appropriately) -- Super 7 can even guarantee NO duplicates during a cycle (of only ~32k), but as you can see above, the target space is still big. See the math at the top of what maintaining a 1-1 relationship requires in terms of minimum encoded size.
The divide-and-carry method just explains how to get your source number into a different base -- perhaps base62. The same general method can be applied to go from the "natural" base (base10 in PHP) to any base.
Here are two conversion functions for Base-16 to Base-64 conversion and the inverse Base-64 to Base-16 for arbitrary input lengths:
If you need Base-64 encoding with the URL and filename safe alphabet , you can use these functions:
If you now want a function to compress your hexadecimal MD5 values using URL safe characters, you can use this:
And the inverse function:
It depends on what
a7d2cd9e0e09bebb6a520af48205ced1
is. Assuming you're talking about a hex number since it's coming frommd5
, you could just run abase64_encode
. If you have the hex in string form, you'd want to runhexdec
. Be careful you don't run into maxint issues though.Of course if I want a function to satisfy my needs perfectly I better make it myself. Here is what I came up with.
This is a very general purpose random string generator, however it is not just any old random string generator because the result is determined by the input string and any slight change to that input will produce a totally different result. You can do all sort of things with this:
Anyone see any problems with it or any room for improvement?
You could just do plain old base conversion. The hash is expressed in hexadecimal, and you can then create an alphabet of the size you want to express the hash. Base64 works well for this purpose, although you will probably want to write your own function so you end up encoding the value, not the string.
Note, however, that standard Base64 contains characters you wouldn't want to put in a URL; +, / and the padding character =. You can replace those characters with something else when converting back and forth to get a URL-safe Base64 encoding (or use a safe set of characters to begin with if you write your own function).