I have a bunch of 10 digit integers that I'm passing in a URL. Something like: "4294965286", "2292964213". They will always be positive and always be 10 digits.
I'd like to compress those integers into the smallest possible form that can still be used in in a URL (aka letters and numbers are perfectly fine) and then uncompress them later. I've looked at using gzipstream but it creates larger strings, not shorter.
I'm currently using asp.net so a vb.net or c# solution would be best.
Thanks
Yes. GZIP is a compression algorithm which both requires compressible data and has an overhead (framing and dictionaries, etc). An encoding algorithm should be used instead.
The "simple" method is to use base-64 encoding.
That is, convert the number (which is represented as base 10 in the string) to the actual series of bytes that represent the number (5 bytes will cover a 10 digit decimal number) and then base-64 that result. Each base-64 character stores 6 bits of information (to the decimals ~3.3 bits/character) and will thus result in a size of approximately just over half (in this case, 6* base-64 output characters are required).
Additionally, since the input/output lengths are obtainable from the data itself, "123" might be originally (before being base-64 encoded) converted as 1 byte, "30000" as 2 bytes, etc. This would be advantageous if not all the numbers are approximately the same length.
Happy coding.
* Using base-64 requires 6 output characters.
Edit: I was wrong initially where I said "2.3 bits/char" for decimal and proposed that less than half the characters were required. I have updated the answer above and show the (should be correct) math here, where
lg(n)
is log to the base 2.The number of input bits required to represent the input number is
bits/char * chars
->lg(10) * 10
(or justlg(9999999999)
) ->~33.2 bits
. Using jball's manipulation to shift the number first, the number of bits required islg(8999999999)
->~33.06 bits
. However this transformation isn't able to increase the efficiency in this particular case (the number of input bits would need to be reduced to 30 or below to make a difference here).So we try to find an x (number of characters in base-64 encoding) such that:
lg(64) * x = 33.2
->6 * x = 33.2
->x ~ 5.53
. Of course five and a half characters is nonsensical so we choose 6 as the maximum number of characters required to encode a value up to 999999999 in base-64 encoding. This is slightly more than half of the original 10 characters.However, it should be noted that to obtain only 6 characters in base-64 output requires a non-standard base-64 encoder or a little bit of manipulation (most base-64 encoders only work on whole bytes). This works because out of the original 5 "required bytes" only 34 of the 40 bits are used (the top 6 bits are always 0). It would require 7 base-64 characters to encode all 40 bits.
Here is a modification of the code that Guffa posted in his answer (if you like it, go give him an up-vote) that only requires 6 base-64 characters. Please see other notes in Guffa's answer and Base64 for URL applications as the method below does not use a URL-friendly mapping.
Making it "prettier"
Since base-64 has been determined to use 6 characters then any encoding variant that still encodes the input bits into 6 characters will create just as small an output. Using a base-32 encoding won't quite make the cut, as in base-32 encoding 6 character can only store 30 bits of information (
lg(32) * 6
).However, the same output size could be achieved with a custom base-48 (or 52/62) encoding. (The advantage of a base 48-62 is that they only requires a subset of alpha-numeric characters and do not need symbols; optionally "ambiguous" symbols like 1 and "I" can be avoided for variants). With a base-48 system the 6 characters can encode ~33.5 bits (
lg(48) * 6
) of information which is just above the ~33.2 (or ~33.06) bits (lg(10) * 10
) required.Here is a proof-of-concept:
The result is:
The above considers the case where the numbers are "random and opaque"; that is, there is nothing that can be determined about the internals of the number. However, if there is a defined structure (e.g. 7th, 8th, and 9th bits are always zero and 2nd and 15th bits are always the same) then -- if and only if 4 or more bits of information can be eliminated from the input -- only 5 base-64 characters would be required. The added complexities and reliance upon the structure very likely outweigh any marginal gain.
In addition to changing the base of the encoding (pst and I had the same thought around the same time), since all your numbers are 10 decimal digits, you can subtract the smallest 10 digit number (10E9) from each number before you encode it, and then add that back in after decoding. This will shift your encoded numbers into the range of 0 - 8999999999, thus allowing for smaller strings after the base change.
I think what you're looking for are Hash IDs: http://hashids.org/
They have implementations in many languages, although it looks like C# is not one of them.
I made an example for you in JavaScript: http://codepen.io/codycraven/pen/MbWwQm
Note that the HashIDs libraries protect your hashes from including foul language.
You could use base64 encoding to reduce the data into seven characters. You need five bytes to represent the number, and those can be encoded into eight characters using base64, but that last character is always the filler
=
, so it can be removed:Output:
To decode the text, you add the
=
again, decode it, and read it as a number:Output:
Two of the characters that base64 uses are not suitable for use in an URL, so you can replace them with other characters, and then replace them back. The
+
and/
characters could for example be replaced by-
and_
.What about converting a big number to a formula: So instead of 21312312312 I might use 4^34. http://mathforum.org/library/drmath/view/65726.html