Best Compression algorithm for a sequence of integ

2019-01-08 05:31发布

I have a large array with a range of integers that are mostly continuous, eg 1-100, 110-160, etc. All integers are positive. What would be the best algorithm to compress this?

I tried the deflate algorithm but that gives me only 50% compression. Note that the algorithm cannot be lossy.

All numbers are unique and progressively increasing.

Also if you can point me to the java implementation of such algorithm that would be great.

15条回答
Rolldiameter
2楼-- · 2019-01-08 05:56

First, preprocess your list of values by taking the difference between each value and the previous one (for the first value, assume the previous one was zero). This should in your case give mostly a sequence of ones, which can be compressed much more easily by most compression algorithms.

This is how the PNG format does to improve its compression (it does one of several difference methods followed by the same compression algorithm used by gzip).

查看更多
男人必须洒脱
3楼-- · 2019-01-08 05:58

We have written recent research papers that survey the best schemes for this problem. Please see:

Daniel Lemire and Leonid Boytsov, Decoding billions of integers per second through vectorization,Software: Practice & Experience 45 (1), 2015. http://arxiv.org/abs/1209.2137

Daniel Lemire, Nathan Kurz, Leonid Boytsov, SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience (to appear) http://arxiv.org/abs/1401.6399

They include an extensive experimental evaluation.

You can find a complete implementation of all techniques in C++11 online: https://github.com/lemire/FastPFor and https://github.com/lemire/SIMDCompressionAndIntersection

There are also C libraries: https://github.com/lemire/simdcomp and https://github.com/lemire/MaskedVByte

If you prefer Java, please see https://github.com/lemire/JavaFastPFOR

查看更多
太酷不给撩
4楼-- · 2019-01-08 05:58

TurboPFor: Fastest Integer Compression

  • for C/C++ including Java Critical Natives/JNI Interface
  • SIMD accelerated integer compression
  • Scalar + Integrated (SIMD) differential/Zigzag encoding/decoding for sorted/unsorted integer lists
  • Full range 8/16/32/64 bits interger lists
  • Direct access
  • Benchmark app
查看更多
登录 后发表回答