Compress sorted integers

2019-01-25 09:37发布

I'm building a index which is just several sets of ordered 32 bit integers stored continuously in a binary file. The problem is that this file grows pretty large. I've been thinking of adding some compressions scheme but that's a bit out of my expertise. So I'm wondering, what compression algorithm would work best in this case? Also, decompression has to be fast since this index will be used to make make look ups.

9条回答
别忘想泡老子
2楼-- · 2019-01-25 09:55

The conditions on the lists of integers is slightly different, but the question Compression for a unique stream of data suggests several approaches which could help you.

I'd suggest prefiltering the data into a start and a series of offsets. If you know that the offsets will reliably small you could even encode them as 1- or 2-byte quantities instead of 4-bytes. If you don't know this, each offset could still be 4 bytes, but since they will be small diffs, you'll get many more repeats than you would storing the original integers.

After prefiltering, run your output through the compression scheme of your choice - something that works on a byte level, like gzip or zlib, would probably do a really nice job.

查看更多
干净又极端
3楼-- · 2019-01-25 09:55

Maybe you could store the differences between consecutive 32-bit integers as 16-bit integers.

查看更多
Melony?
4楼-- · 2019-01-25 09:56

I would imagine Huffman coding would be quite appropiate for this purpose (and relatively quick compared to other algorithms with similar compression ratios).

EDIT: My answer was only a general pointer. Niyaz's suggestion of encoding the differences between consecutive numbers is a good one. (However if the list is not ordered or the spacing of numbers is very irregular, I think it would be no less effective to use plain Huffman encoding. In fact LZW or similar would likely be best in this case, though possibly still not very good.)

查看更多
ゆ 、 Hurt°
5楼-- · 2019-01-25 09:56

MSalters' answer is interesting but might distract you if you don't analyze properly. There are only 47 Fibonacci numbers that fit in 32-bits.

But he is spot on on how to properly solve the problem by analyzing the series of increments to find patterns there to compress.

Things that matter: a) Are there repeated values? If so, how often? (if important, make it part of the compression, if not make it an exception.) b) Does it look quasi-random? This also can be good as a suitable average increment can likely be found.

查看更多
乱世女痞
6楼-- · 2019-01-25 09:59

As you've discovered, a sorted sequence of N 32 bits integers doesn't have 32*N bits of data. This is no surprise. Assuming no duplicates, for every sorted sequence there are N! unsorted seqeuences containing the same integers.

Now, how do you take advantage of the limited information in the sorted sequence? Many compression algorithms base their compression on the use of shorter bitstrings for common input values (Huffman uses only this trick). Several posters have already suggested calculating the differences between numbers, and compressing those differences. They assume it will be a series of small numbers, many of which will be identical. In that case, the difference sequence will be compressed well by most algorithms.

However, take the Fibonacci sequence. That's definitely sorted integers. The difference between F(n) and F(n+1) is F(n-1). Hence, compressing the sequence of differences is equivalent to compressing the sequence itself - it doesn't help at all!

So, what we really need is a statistical model of your input data. Given the sequence N[0]...N[x], what is the probability distribution of N[x+1] ? We know that P(N[x+1] < N[x]) = 0, as the sequence is sorted. The differential/Huffman-based solutions presented work because they assume P(N[x+1] - N[x] = d) is quite high for small positive d and independent from x, so they use can use a few bits for the small differences. If you can give another model, you can optimize for that.

查看更多
老娘就宠你
7楼-- · 2019-01-25 10:02

I'd use something bog standard off the shelf before investing in your own scheme.

In Java for example you can use GZIPOutputStream to apply gzip compression.

查看更多
登录 后发表回答