My ASP.NET page has following query string parameter:
…?IDs=1000000012,1000000021,1000000013,1000000022&...
Here IDs
parameter will always have numbers separated by something, in this case ,
. Currently there are 4 numbers but normally they would be in between 3
and 7
.
Now, I am looking for method to convert each big number from above into smallest possible value; specifically compressing value of IDs
query string parameter. Both, compressing each number algorithm or compressing whole value of IDs
query string parameter are welcome.
- Encode or decode is not an issue; just compressing the value
IDs
query string parameter. - Creating some unique small value for
IDs
and then retrieving its value from some data source is out of scope.
Is there an algorithm to compress such big numbers to small values or to compress value of the IDs
query string parameter all together?
Here's another really simple scheme that should give good compression for a set of numbers of the form
N + delta
where N is a large constant.This should reduce the set
{1000000012,1000000021,1000000013,1000000022}
to the list[1000000012,1,9,1]
, which you can then compress further by representing the numbers in base47 encoding as described in another answer.Using simple decimal encoding, this goes from 44 characters to 16 characters; i.e. 63%. (And using base47 will give even more compression).
If it is unacceptable to sort the ids, you don't get quite as good compression. For this example,
{1000000012,1000000021,1000000013,1000000022}
compresses to the list[1000000012,9,-8,9]
. That is just one character longer for this exampleEither way, this is better than a generic compression algorithm or encoding schemes ... FOR THIS KIND OF INPUT.
If the only issue is the URL length, you can convert numbers to base64 characters, then convert them back to numbers at the server side
You basically need so much room for your numbers because you are using base 10 to represent them. An improvement would be to use base 16 (hex). So for example, you could represent 255 (3 digits) as ff (2 digits).
You can take that concept further by using a much larger number base... the set of all characters that are valid query string parameters:
A-Z, a-z, 0-9, '.', '-', '~', '_', '+'
That gives you a base of 67 characters to work with (see Wikipedia on QueryString).
Have a look at this SO post for approaches to converting base 10 to arbitrary number bases.
EDIT:
In the linked SO post, look at this part:
That's almost what you need. Just expand it by adding the few characters it is missing:
yz.-~_+
That post is missing a method to go back to base 10. I'm not going to write it :-) but the procedure is like this:
Define a counter I'll call TOTAL.
Look at the right most character and find it's position in the array.
TOTAL = (the position of the character in the array) Example: Input is BA1. TOTAL is now 1 (since "1" is in position 1 in the array)
Now look at the next character left of the first one and find it's position in the array. TOTAL += 47 * (the position of the character in the array) Example: Input is BA1. TOTAL is now (47 * 11) + 1 = 518
Now look at the next character left of the previous one and find it's position in the array. TOTAL += 47 * 47 * (the position of the character in the array) Example: Input is BA1. Total is now (47 * 47 * 10) + (47 * 11) + 1 = 243508
And so on.
I suggest you write a unit test that converts a bunch of base 10 numbers into base 47 and then back again to make sure your conversion code works properly.
Note how you represented a 6 digit base 10 number in just 3 digits of base 47 :-)
how patterned are the IDs you are getting? if digit by digit, the IDs are random, then the method I am about to propose won't be very efficient. but if the IDs you gave as an example are representative of the types you'd be getting, then perhaps the following could work?
i motivate this idea by example.
you have for example, 1000000012 as ID that you'd like to compress. why not store it as [{1},{0,7},{12}]? This would mean that the first digit is a 1 followed by 7 zeros followed by a 12. Thus if we use the notation {x} that would represent one instance of x, while if we use {x,y} that would mean that x occurs y times in a row.
you could extend this with a little bit of pattern matching and/or function fitting.
for example, pattern matching: 1000100032 would be [{1000,2}{32}].
for example, function fitting: if your IDs are 10 digits, then split the ID into two 5 digit numbers and store the equation of the line that goes through both points. if ID = 1000000012, the you have y1 = 10000 and y2 = 12. therefore, your slope is -9988 and your intercept is 10000 (assuming x1 = 0, x2 = 1). In this case, it's not an improvement, but if the numbers were more random, it could be. Equivalently, you could store the sequence of IDs with piecewise linear functions.
in any case, this mostly depends on the structure of your IDs.
I assume you are doing this as a workaround for request URL length restrictions ...
Other answers have suggested encoding the decimal id numbers in hex, base47 or base64, but you can (in theory) do a lot better than that by using LZW (or similar) to compress the id list. Depending on how much redundancy there is in your ID lists, you could get significantly more than 40% reduction, even after re-encoding the compressed bytes as text.
In a nut-shell, I suggest that you find an off-the-shelf text compression library implemented in Javascript and use it client side to compress the ID list. Then encode the compressed bytestring using base47/base64, and pass the encoded string as the URL parameter. On the server side do the reverse; i.e. decode followed by decompress.
EDIT: As an experiment, I created a list of 36 different identifiers like the ones you supplied and compressed it using gzip. The original file is 396 bytes, the compressed file is 101 bytes, and the compressed + base64 file 138 bytes. That is a 65% reduction overall. And the compression ratio could actually improve for larger files. However, when I tried this with a small input set (e.g. just the 4 original identifiers), I got no compression, and after encoding the size was larger than the original.
Google "lzw library javascript"
In theory, there might be simpler solution. Send the parameters as "post data" rather than in the request URL, and get the browser to apply the compression using one of the encodings that it understands. That will give you more savings too since there is no need to encode the compressed data into legal URL characters.
The problem is getting the browser to compress the request ... and doing that in a browser independent way.
What is the range of your numbers? Assuming they can fit in a 16-bit integer, I would:
As an added bonus you don't need comma characters anymore because you know each number is 2 bytes.
Alternatively, if that isn't good enough, I'd use zlib to compress your stream of integers and then base64 the zlib-compressed stream. You can also switch to 32-bit integers if 16-bit isn't a large enough range (i.e. if you really need numbers in the 1,000,000,000 range).
Edit:
Maybe too late, but here's an implementation that might do what you need: