Compressing big number (or string) to small value

2019-01-24 13:53发布

My ASP.NET page has following query string parameter:

…?IDs=1000000012,1000000021,1000000013,1000000022&...

Here IDs parameter will always have numbers separated by something, in this case ,. Currently there are 4 numbers but normally they would be in between 3 and 7.

Now, I am looking for method to convert each big number from above into smallest possible value; specifically compressing value of IDs query string parameter. Both, compressing each number algorithm or compressing whole value of IDs query string parameter are welcome.

  1. Encode or decode is not an issue; just compressing the value IDs query string parameter.
  2. Creating some unique small value for IDs and then retrieving its value from some data source is out of scope.

Is there an algorithm to compress such big numbers to small values or to compress value of the IDs query string parameter all together?

6条回答
小情绪 Triste *
2楼-- · 2019-01-24 14:29

Here's another really simple scheme that should give good compression for a set of numbers of the form N + delta where N is a large constant.

public int[] compress(int[] input) {
    int[] res = input.clone();
    Arrays.sort(res);
    for (int i = 1; i < res.length; i++) {
        res[i] = res[i] - res[i - 1];
    }
    return res;
}

This should reduce the set {1000000012,1000000021,1000000013,1000000022} to the list [1000000012,1,9,1], which you can then compress further by representing the numbers in base47 encoding as described in another answer.

Using simple decimal encoding, this goes from 44 characters to 16 characters; i.e. 63%. (And using base47 will give even more compression).

If it is unacceptable to sort the ids, you don't get quite as good compression. For this example, {1000000012,1000000021,1000000013,1000000022} compresses to the list [1000000012,9,-8,9]. That is just one character longer for this example

Either way, this is better than a generic compression algorithm or encoding schemes ... FOR THIS KIND OF INPUT.

查看更多
虎瘦雄心在
3楼-- · 2019-01-24 14:32

If the only issue is the URL length, you can convert numbers to base64 characters, then convert them back to numbers at the server side

查看更多
闹够了就滚
4楼-- · 2019-01-24 14:38

You basically need so much room for your numbers because you are using base 10 to represent them. An improvement would be to use base 16 (hex). So for example, you could represent 255 (3 digits) as ff (2 digits).

You can take that concept further by using a much larger number base... the set of all characters that are valid query string parameters:

A-Z, a-z, 0-9, '.', '-', '~', '_', '+'

That gives you a base of 67 characters to work with (see Wikipedia on QueryString).

Have a look at this SO post for approaches to converting base 10 to arbitrary number bases.

EDIT:

In the linked SO post, look at this part:

string xx = IntToString(42, 
            new char[] { '0','1','2','3','4','5','6','7','8','9',
            'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z',
            'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x'});

That's almost what you need. Just expand it by adding the few characters it is missing:

yz.-~_+

That post is missing a method to go back to base 10. I'm not going to write it :-) but the procedure is like this:

Define a counter I'll call TOTAL.

Look at the right most character and find it's position in the array.
TOTAL = (the position of the character in the array) Example: Input is BA1. TOTAL is now 1 (since "1" is in position 1 in the array)

Now look at the next character left of the first one and find it's position in the array. TOTAL += 47 * (the position of the character in the array) Example: Input is BA1. TOTAL is now (47 * 11) + 1 = 518

Now look at the next character left of the previous one and find it's position in the array. TOTAL += 47 * 47 * (the position of the character in the array) Example: Input is BA1. Total is now (47 * 47 * 10) + (47 * 11) + 1 = 243508

And so on.

I suggest you write a unit test that converts a bunch of base 10 numbers into base 47 and then back again to make sure your conversion code works properly.

Note how you represented a 6 digit base 10 number in just 3 digits of base 47 :-)

查看更多
我命由我不由天
5楼-- · 2019-01-24 14:42

how patterned are the IDs you are getting? if digit by digit, the IDs are random, then the method I am about to propose won't be very efficient. but if the IDs you gave as an example are representative of the types you'd be getting, then perhaps the following could work?

i motivate this idea by example.

you have for example, 1000000012 as ID that you'd like to compress. why not store it as [{1},{0,7},{12}]? This would mean that the first digit is a 1 followed by 7 zeros followed by a 12. Thus if we use the notation {x} that would represent one instance of x, while if we use {x,y} that would mean that x occurs y times in a row.

you could extend this with a little bit of pattern matching and/or function fitting.

for example, pattern matching: 1000100032 would be [{1000,2}{32}].

for example, function fitting: if your IDs are 10 digits, then split the ID into two 5 digit numbers and store the equation of the line that goes through both points. if ID = 1000000012, the you have y1 = 10000 and y2 = 12. therefore, your slope is -9988 and your intercept is 10000 (assuming x1 = 0, x2 = 1). In this case, it's not an improvement, but if the numbers were more random, it could be. Equivalently, you could store the sequence of IDs with piecewise linear functions.

in any case, this mostly depends on the structure of your IDs.

查看更多
男人必须洒脱
6楼-- · 2019-01-24 14:46

I assume you are doing this as a workaround for request URL length restrictions ...

Other answers have suggested encoding the decimal id numbers in hex, base47 or base64, but you can (in theory) do a lot better than that by using LZW (or similar) to compress the id list. Depending on how much redundancy there is in your ID lists, you could get significantly more than 40% reduction, even after re-encoding the compressed bytes as text.

In a nut-shell, I suggest that you find an off-the-shelf text compression library implemented in Javascript and use it client side to compress the ID list. Then encode the compressed bytestring using base47/base64, and pass the encoded string as the URL parameter. On the server side do the reverse; i.e. decode followed by decompress.

EDIT: As an experiment, I created a list of 36 different identifiers like the ones you supplied and compressed it using gzip. The original file is 396 bytes, the compressed file is 101 bytes, and the compressed + base64 file 138 bytes. That is a 65% reduction overall. And the compression ratio could actually improve for larger files. However, when I tried this with a small input set (e.g. just the 4 original identifiers), I got no compression, and after encoding the size was larger than the original.

Google "lzw library javascript"

In theory, there might be simpler solution. Send the parameters as "post data" rather than in the request URL, and get the browser to apply the compression using one of the encodings that it understands. That will give you more savings too since there is no need to encode the compressed data into legal URL characters.

The problem is getting the browser to compress the request ... and doing that in a browser independent way.

查看更多
淡お忘
7楼-- · 2019-01-24 14:50

What is the range of your numbers? Assuming they can fit in a 16-bit integer, I would:

  • Store all your numbers as 16-bit integers (2 bytes per number, range -32,768 to 32,767)
  • Build a bytestream of 16-bit integers (XDR might be a good option here; at very least, make sure to handle endianness correctly)
  • Base64 encode the bytestream, using the modified base64 encoding for URLs (net is about 3 characters per number)

As an added bonus you don't need comma characters anymore because you know each number is 2 bytes.

Alternatively, if that isn't good enough, I'd use zlib to compress your stream of integers and then base64 the zlib-compressed stream. You can also switch to 32-bit integers if 16-bit isn't a large enough range (i.e. if you really need numbers in the 1,000,000,000 range).

Edit:

Maybe too late, but here's an implementation that might do what you need:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Scratch {
    class Program {
        static void Main(string[] args) {
            //var ids = new[] { 1000000012, 1000000021, 1000000013, 1000000022 };
            var rand = new Random();
            var ids = new int[rand.Next(20)];
            for(var i = 0; i < ids.Length; i++) {
                ids[i] = rand.Next();
            }

            WriteIds(ids);
            var s = IdsToString(ids);
            Console.WriteLine("\nResult string is: {0}", s);
            var newIds = StringToIds(s);
            WriteIds(newIds);
            Console.ReadLine();
        }

        public static void WriteIds(ICollection<Int32> ids) {
            Console.Write("\nIDs: ");
            bool comma = false;
            foreach(var id in ids) {
                if(comma) {
                    Console.Write(",");
                } else {
                    comma = true;
                }
                Console.Write(id);
            }
            Console.WriteLine();
        }

        public static string IdsToString(ICollection<Int32> ids) {
            var allbytes = new List<byte>();
            foreach(var id in ids) {
                var bytes = BitConverter.GetBytes(id);
                allbytes.AddRange(bytes);                
            }
            var str = Convert.ToBase64String(allbytes.ToArray(), Base64FormattingOptions.None);
            return str.Replace('+', '-').Replace('/', '_').Replace('=', '.');
        }

        public static ICollection<Int32> StringToIds(string idstring) {
            var result = new List<Int32>();
            var str = idstring.Replace('-', '+').Replace('_', '/').Replace('.', '=');
            var bytes = Convert.FromBase64String(str);
            for(var i = 0; i < bytes.Length; i += 4) {
                var id = BitConverter.ToInt32(bytes, i);
                result.Add(id);
            }
            return result;
        }
    }
}
查看更多
登录 后发表回答