Best Compression algorithm for a sequence of integ

2019-01-08 05:31发布

I have a large array with a range of integers that are mostly continuous, eg 1-100, 110-160, etc. All integers are positive. What would be the best algorithm to compress this?

I tried the deflate algorithm but that gives me only 50% compression. Note that the algorithm cannot be lossy.

All numbers are unique and progressively increasing.

Also if you can point me to the java implementation of such algorithm that would be great.

15条回答
女痞
2楼-- · 2019-01-08 05:42

compress the string "1-100, 110-160" or store the string in some binary representation and parse it to restore the array

查看更多
看我几分像从前
3楼-- · 2019-01-08 05:46

While you could design a custom algorithm specific to your stream of data, it's probably easier to use an off the shelf encoding algorithm. I ran a few tests of compression algorithms available in Java and found the following compression rates for a sequence of one million consecutive integers:

None        1.0
Deflate     0.50
Filtered    0.34
BZip2       0.11
Lzma        0.06
查看更多
Deceive 欺骗
4楼-- · 2019-01-08 05:49

What size are the numbers? In addition to the other answers, you could consider base-128 variant-length encoding, which lets you store smaller numbers in single bytes while still allowing larger numbers. The MSB means "there is another byte" - this is described here.

Combine this with the other techniques so you are storing "skip size", "take size", "skip size", "take size" - but noting that neither "skip" nor "take" will ever be zero, so we'll subtract one from each (which lets you save an extra byte for a handful of values)

So:

1-100, 110-160

is "skip 1" (assume start at zero as it makes things easier), "take 100", "skip 9", "take 51"; subtract 1 from each, giving (as decimals)

0,99,8,50

which encodes as (hex):

00 63 08 32

If we wanted to skip/take a larger number - 300, for example; we subtract 1 giving 299 - but that goes over 7 bits; starting with the little end, we encode blocks of 7 bits and an MSB to indicate continuation:

299 = 100101100 = (in blocks of 7): 0000010 0101100

so starting with the little end:

1 0101100 (leading one since continuation)
0 0000010 (leading zero as no more)

giving:

AC 02

So we can encode large numbers easily, but small numbers (which sound typical for skip/take) take less space.

You could try running this through "deflate", but it might not help much more...


If you don't want to deal with all that messy encoding cruff yourself... if you can create the integer-array of the values (0,99,8,60) - you could use protocol buffers with a packed repeated uint32/uint64 - and it'll do all the work for you ;-p

I don't "do" Java, but here's a full C# implementation (borrowing some of the encoding bits from my protobuf-net project):

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
static class Program
{
    static void Main()
    {
        var data = new List<int>();
        data.AddRange(Enumerable.Range(1, 100));
        data.AddRange(Enumerable.Range(110, 51));
        int[] arr = data.ToArray(), arr2;

        using (MemoryStream ms = new MemoryStream())
        {
            Encode(ms, arr);
            ShowRaw(ms.GetBuffer(), (int)ms.Length);
            ms.Position = 0; // rewind to read it...
            arr2 = Decode(ms);
        }
    }
    static void ShowRaw(byte[] buffer, int len)
    {
        for (int i = 0; i < len; i++)
        {
            Console.Write(buffer[i].ToString("X2"));
        }
        Console.WriteLine();
    }
    static int[] Decode(Stream stream)
    {
        var list = new List<int>();
        uint skip, take;
        int last = 0;
        while (TryDecodeUInt32(stream, out skip)
            && TryDecodeUInt32(stream, out take))
        {
            last += (int)skip+1;
            for(uint i = 0 ; i <= take ; i++) {
                list.Add(last++);
            }
        }
        return list.ToArray();
    }
    static int Encode(Stream stream, int[] data)
    {
        if (data.Length == 0) return 0;
        byte[] buffer = new byte[10];
        int last = -1, len = 0;
        for (int i = 0; i < data.Length; )
        {
            int gap = data[i] - 2 - last, size = 0;
            while (++i < data.Length && data[i] == data[i - 1] + 1) size++;
            last = data[i - 1];
            len += EncodeUInt32((uint)gap, buffer, stream)
                + EncodeUInt32((uint)size, buffer, stream);
        }
        return len;
    }
    public static int EncodeUInt32(uint value, byte[] buffer, Stream stream)
    {
        int count = 0, index = 0;
        do
        {
            buffer[index++] = (byte)((value & 0x7F) | 0x80);
            value >>= 7;
            count++;
        } while (value != 0);
        buffer[index - 1] &= 0x7F;
        stream.Write(buffer, 0, count);
        return count;
    }
    public static bool TryDecodeUInt32(Stream source, out uint value)
    {
        int b = source.ReadByte();
        if (b < 0)
        {
            value = 0;
            return false;
        }

        if ((b & 0x80) == 0)
        {
            // single-byte
            value = (uint)b;
            return true;
        }

        int shift = 7;

        value = (uint)(b & 0x7F);
        bool keepGoing;
        int i = 0;
        do
        {
            b = source.ReadByte();
            if (b < 0) throw new EndOfStreamException();
            i++;
            keepGoing = (b & 0x80) != 0;
            value |= ((uint)(b & 0x7F)) << shift;
            shift += 7;
        } while (keepGoing && i < 4);
        if (keepGoing && i == 4)
        {
            throw new OverflowException();
        }
        return true;
    }
}
查看更多
Anthone
5楼-- · 2019-01-08 05:50

I'd combine the answers given by CesarB and Fernando Miguélez.

First, store the differences between each value and the previous one. As CesarB pointed out, this will give you a sequence of mostly ones.

Then, use a Run Length Encoding compression algorithm on this sequence. It will compress very nicely due to the large number of repeated values.

查看更多
啃猪蹄的小仙女
6楼-- · 2019-01-08 05:50

Your case is very similar to compression of indices in search engines. The popular compression algorithm used is the PForDelta algorithm and Simple16 algorithm. You can use the kamikaze library for your compression needs.

查看更多
来,给爷笑一个
7楼-- · 2019-01-08 05:53

I couldn't get my compression to be much better than about .11 for this. I generated my test data via python interpreter and it's a newline delimited list of integers from 1-100, and 110-160. I use the actual program as a compressed representation of the data. My compressed file is as follows:

main=mapM_ print [x|x<-[1..160],x`notElem`[101..109]]

It's just a Haskell script that generates the the file you can run via:

$ runhaskell generator.hs >> data

The size of the g.hs file is 54 bytes, and the python generated data is 496 bytes. This gives 0.10887096774193548 as the compression ratio. I think with more time one could shrink the program, or you could compress the compressed file (i.e. the haskell file).

One other approach might be to save 4 bytes of data. The min and max of each sequence, then give those to a generating function. Albeit, the loading of files will add more characters to the decompresser adding more complexity and more bytes to the decompresser. Again, I represented this very specific sequence via a program and it doesn't generalize, it's a compression that's specific to the data. Furthermore, adding generality makes the decompresser larger.

Another concern is that one must have the Haskell interpreter to run this. When I compiled the program it made it much larger. I don't really know why. There is the same problem with python, so maybe the best approach is to give the ranges, so that a some program could decompress the file.

查看更多
登录 后发表回答