Best Compression algorithm for a sequence of integ-第2页回答

I have a large array with a range of integers that are mostly continuous, eg 1-100, 110-160, etc. All integers are positive. What would be the best algorithm to compress this?

I tried the deflate algorithm but that gives me only 50% compression. Note that the algorithm cannot be lossy.

All numbers are unique and progressively increasing.

Also if you can point me to the java implementation of such algorithm that would be great.

标签： algorithm compression

15条回答

女痞

2楼-- · 2019-01-08 05:42

compress the string "1-100, 110-160" or store the string in some binary representation and parse it to restore the array

0人赞添加讨论(0) 举报

看我几分像从前

3楼-- · 2019-01-08 05:46

While you could design a custom algorithm specific to your stream of data, it's probably easier to use an off the shelf encoding algorithm. I ran a few tests of compression algorithms available in Java and found the following compression rates for a sequence of one million consecutive integers:

None        1.0
Deflate     0.50
Filtered    0.34
BZip2       0.11
Lzma        0.06

0人赞添加讨论(0) 举报

Deceive 欺骗

4楼-- · 2019-01-08 05:49

What size are the numbers? In addition to the other answers, you could consider base-128 variant-length encoding, which lets you store smaller numbers in single bytes while still allowing larger numbers. The MSB means "there is another byte" - this is described here.

Combine this with the other techniques so you are storing "skip size", "take size", "skip size", "take size" - but noting that neither "skip" nor "take" will ever be zero, so we'll subtract one from each (which lets you save an extra byte for a handful of values)

So:

1-100, 110-160

is "skip 1" (assume start at zero as it makes things easier), "take 100", "skip 9", "take 51"; subtract 1 from each, giving (as decimals)

0,99,8,50

which encodes as (hex):

00 63 08 32

If we wanted to skip/take a larger number - 300, for example; we subtract 1 giving 299 - but that goes over 7 bits; starting with the little end, we encode blocks of 7 bits and an MSB to indicate continuation:

299 = 100101100 = (in blocks of 7): 0000010 0101100

so starting with the little end:

1 0101100 (leading one since continuation)
0 0000010 (leading zero as no more)

giving:

AC 02

So we can encode large numbers easily, but small numbers (which sound typical for skip/take) take less space.

You could try running this through "deflate", but it might not help much more...

If you don't want to deal with all that messy encoding cruff yourself... if you can create the integer-array of the values (0,99,8,60) - you could use protocol buffers with a packed repeated uint32/uint64 - and it'll do all the work for you ;-p

I don't "do" Java, but here's a full C# implementation (borrowing some of the encoding bits from my protobuf-net project):

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
static class Program
{
    static void Main()
    {
        var data = new List<int>();
        data.AddRange(Enumerable.Range(1, 100));
        data.AddRange(Enumerable.Range(110, 51));
        int[] arr = data.ToArray(), arr2;

        using (MemoryStream ms = new MemoryStream())
        {
            Encode(ms, arr);
            ShowRaw(ms.GetBuffer(), (int)ms.Length);
            ms.Position = 0; // rewind to read it...
            arr2 = Decode(ms);
        }
    }
    static void ShowRaw(byte[] buffer, int len)
    {
        for (int i = 0; i < len; i++)
        {
            Console.Write(buffer[i].ToString("X2"));
        }
        Console.WriteLine();
    }
    static int[] Decode(Stream stream)
    {
        var list = new List<int>();
        uint skip, take;
        int last = 0;
        while (TryDecodeUInt32(stream, out skip)
            && TryDecodeUInt32(stream, out take))
        {
            last += (int)skip+1;
            for(uint i = 0 ; i <= take ; i++) {
                list.Add(last++);
            }
        }
        return list.ToArray();
    }
    static int Encode(Stream stream, int[] data)
    {
        if (data.Length == 0) return 0;
        byte[] buffer = new byte[10];
        int last = -1, len = 0;
        for (int i = 0; i < data.Length; )
        {
            int gap = data[i] - 2 - last, size = 0;
            while (++i < data.Length && data[i] == data[i - 1] + 1) size++;
            last = data[i - 1];
            len += EncodeUInt32((uint)gap, buffer, stream)
                + EncodeUInt32((uint)size, buffer, stream);
        }
        return len;
    }
    public static int EncodeUInt32(uint value, byte[] buffer, Stream stream)
    {
        int count = 0, index = 0;
        do
        {
            buffer[index++] = (byte)((value & 0x7F) | 0x80);
            value >>= 7;
            count++;
        } while (value != 0);
        buffer[index - 1] &= 0x7F;
        stream.Write(buffer, 0, count);
        return count;
    }
    public static bool TryDecodeUInt32(Stream source, out uint value)
    {
        int b = source.ReadByte();
        if (b < 0)
        {
            value = 0;
            return false;
        }

        if ((b & 0x80) == 0)
        {
            // single-byte
            value = (uint)b;
            return true;
        }

        int shift = 7;

        value = (uint)(b & 0x7F);
        bool keepGoing;
        int i = 0;
        do
        {
            b = source.ReadByte();
            if (b < 0) throw new EndOfStreamException();
            i++;
            keepGoing = (b & 0x80) != 0;
            value |= ((uint)(b & 0x7F)) << shift;
            shift += 7;
        } while (keepGoing && i < 4);
        if (keepGoing && i == 4)
        {
            throw new OverflowException();
        }
        return true;
    }
}

0人赞添加讨论(0) 举报

Anthone

5楼-- · 2019-01-08 05:50

I'd combine the answers given by CesarB and Fernando Miguélez.

First, store the differences between each value and the previous one. As CesarB pointed out, this will give you a sequence of mostly ones.

Then, use a Run Length Encoding compression algorithm on this sequence. It will compress very nicely due to the large number of repeated values.

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

6楼-- · 2019-01-08 05:50

Your case is very similar to compression of indices in search engines. The popular compression algorithm used is the PForDelta algorithm and Simple16 algorithm. You can use the kamikaze library for your compression needs.

0人赞添加讨论(0) 举报

来，给爷笑一个

7楼-- · 2019-01-08 05:53

I couldn't get my compression to be much better than about .11 for this. I generated my test data via python interpreter and it's a newline delimited list of integers from 1-100, and 110-160. I use the actual program as a compressed representation of the data. My compressed file is as follows:

main=mapM_ print [x|x<-[1..160],x`notElem`[101..109]]

It's just a Haskell script that generates the the file you can run via:

$ runhaskell generator.hs >> data

The size of the g.hs file is 54 bytes, and the python generated data is 496 bytes. This gives 0.10887096774193548 as the compression ratio. I think with more time one could shrink the program, or you could compress the compressed file (i.e. the haskell file).

One other approach might be to save 4 bytes of data. The min and max of each sequence, then give those to a generating function. Albeit, the loading of files will add more characters to the decompresser adding more complexity and more bytes to the decompresser. Again, I represented this very specific sequence via a program and it doesn't generalize, it's a compression that's specific to the data. Furthermore, adding generality makes the decompresser larger.

Another concern is that one must have the Haskell interpreter to run this. When I compiled the program it made it much larger. I don't really know why. There is the same problem with python, so maybe the best approach is to give the ranges, so that a some program could decompress the file.

0人赞添加讨论(0) 举报

Best Compression algorithm for a sequence of integ

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间