I'm building a index which is just several sets of ordered 32 bit integers stored continuously in a binary file. The problem is that this file grows pretty large. I've been thinking of adding some compressions scheme but that's a bit out of my expertise. So I'm wondering, what compression algorithm would work best in this case? Also, decompression has to be fast since this index will be used to make make look ups.
相关问题
- Do the Java Integer and Double objects have unnece
- iOS (objective-c) compression_decode_buffer() retu
- Merge dataframes that have indices that one contai
- Is there a public implemention of LZSS compression
- Is it possible to distinguish Bool and Int in Swif
相关文章
- Use savefig in Python with string and iterative in
- Accessing an array element when returning from a f
- c# saving very large bitmaps as jpegs (or any othe
- Index not used when LIMIT is used in postgres
- SELECT statement not using possible_keys
- How to use uuid with postgresql gist index type?
- Lua Integer type
- Pandas slicing excluding the end
The conditions on the lists of integers is slightly different, but the question Compression for a unique stream of data suggests several approaches which could help you.
I'd suggest prefiltering the data into a
start
and a series ofoffset
s. If you know that the offsets will reliably small you could even encode them as 1- or 2-byte quantities instead of 4-bytes. If you don't know this, each offset could still be 4 bytes, but since they will be small diffs, you'll get many more repeats than you would storing the original integers.After prefiltering, run your output through the compression scheme of your choice - something that works on a byte level, like gzip or zlib, would probably do a really nice job.
Maybe you could store the differences between consecutive 32-bit integers as 16-bit integers.
I would imagine Huffman coding would be quite appropiate for this purpose (and relatively quick compared to other algorithms with similar compression ratios).
EDIT: My answer was only a general pointer. Niyaz's suggestion of encoding the differences between consecutive numbers is a good one. (However if the list is not ordered or the spacing of numbers is very irregular, I think it would be no less effective to use plain Huffman encoding. In fact LZW or similar would likely be best in this case, though possibly still not very good.)
MSalters' answer is interesting but might distract you if you don't analyze properly. There are only 47 Fibonacci numbers that fit in 32-bits.
But he is spot on on how to properly solve the problem by analyzing the series of increments to find patterns there to compress.
Things that matter: a) Are there repeated values? If so, how often? (if important, make it part of the compression, if not make it an exception.) b) Does it look quasi-random? This also can be good as a suitable average increment can likely be found.
As you've discovered, a sorted sequence of N 32 bits integers doesn't have 32*N bits of data. This is no surprise. Assuming no duplicates, for every sorted sequence there are N! unsorted seqeuences containing the same integers.
Now, how do you take advantage of the limited information in the sorted sequence? Many compression algorithms base their compression on the use of shorter bitstrings for common input values (Huffman uses only this trick). Several posters have already suggested calculating the differences between numbers, and compressing those differences. They assume it will be a series of small numbers, many of which will be identical. In that case, the difference sequence will be compressed well by most algorithms.
However, take the Fibonacci sequence. That's definitely sorted integers. The difference between F(n) and F(n+1) is F(n-1). Hence, compressing the sequence of differences is equivalent to compressing the sequence itself - it doesn't help at all!
So, what we really need is a statistical model of your input data. Given the sequence N[0]...N[x], what is the probability distribution of N[x+1] ? We know that P(N[x+1] < N[x]) = 0, as the sequence is sorted. The differential/Huffman-based solutions presented work because they assume P(N[x+1] - N[x] = d) is quite high for small positive d and independent from x, so they use can use a few bits for the small differences. If you can give another model, you can optimize for that.
I'd use something bog standard off the shelf before investing in your own scheme.
In Java for example you can use GZIPOutputStream to apply gzip compression.