TLDR; Of the various compression algorithms available in python gzip
, bz2
, lzma
, etc, which has the best decompression performance?
Full discussion:
Python 3 has various modules for compressing/decompressing data
including gzip
, bz2
and lzma
. gzip
and bz2
additionally have different compression levels you can set.
If my goal is to balance file size (/compression ratio) and decompression speed (compression speed is not a concern), which is going to be the best choice? Decompression speed is more important than file size, but as the uncompressed files in question would be around 600-800MB each (32-bit RGB .png image files), and I have a dozen of them, I do want some compression.
My use case is that I am loading a dozen images from disk, doing some processing on them (as a numpy array) and then using the processed array data in my program.
- The images never change, I just have to load them each time I run my program.
- The processing takes about the same length of time as the loading (several seconds), so I'm trying to save some loading time by saving the processed data (using
pickle
) rather than loading the raw, unprocessed, images every time. Initial tests were promising - loading the raw/uncompressed pickled data took less than a second, vs 3 or 4 seconds to load and process the original image - but as mentioned resulted in file sizes of around 600-800MB, while the original png images were only around 5MB. So I'm hoping I can strike a balance between loading time and file size by storing the picked data in a compressed format.
UPDATE: The situation is actually a bit more complicated than I represented above. My application uses
PySide2
, so I have access to theQt
libraries.- If I read the images and convert to a numpy array using
pillow
(PIL.Image
), I actually don't have to do any processing, but the total time to read the image into the array is around 4 seconds. - If instead I use
QImage
to read the image, I then have to do some processing on the result to make it usable for the rest of my program due to the endian-ness of howQImage
loads the data - basically I have to swap the bit order and then rotate each "pixel" so that the alpha channel (which is apparently added by QImage) comes last rather than first. This whole process takes about 3.8 seconds, so marginally faster than just using PIL. - If I save the
numpy
array uncompressed, then I can load them back in in .8 seconds, so by far the fastest, but with large file size.
- If I read the images and convert to a numpy array using
┌────────────┬────────────────────────┬───────────────┬─────────────┐
│ Python Ver │ Library/Method │ Read/unpack + │ Compression │
│ │ │ Decompress (s)│ Ratio │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.7.2 │ pillow (PIL.Image) │ 4.0 │ ~0.006 │
│ 3.7.2 │ Qt (QImage) │ 3.8 │ ~0.006 │
│ 3.7.2 │ numpy (uncompressed) │ 0.8 │ 1.0 │
│ 3.7.2 │ gzip (compresslevel=9) │ ? │ ? │
│ 3.7.2 │ gzip (compresslevel=?) │ ? │ ? │
│ 3.7.2 │ bz2 (compresslevel=9) │ ? │ ? │
│ 3.7.2 │ bz2 (compresslevel=?) │ ? │ ? │
│ 3.7.2 │ lzma │ ? │ ? │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.7.3 │ ? │ ? │ ? │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.8beta1 │ ? │ ? │ ? │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.8.0final │ ? │ ? │ ? │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.5.7 │ ? │ ? │ ? │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.6.10 │ ? │ ? │ ? │
└────────────┴────────────────────────┴───────────────┴─────────────┘
Sample .png image: As an example, take this 5.0Mb png image, a fairly high resolution image of the coastline of Alaska.
Code for the png/PIL case (load into a numpy
array):
from PIL import Image
import time
import numpy
start = time.time()
FILE = '/path/to/file/AlaskaCoast.png'
Image.MAX_IMAGE_PIXELS = None
img = Image.open(FILE)
arr = numpy.array(img)
print("Loaded in", time.time()-start)
this load takes around 4.2s on my machine with Python 3.7.2.
Alternatively, I can instead load the uncompressed pickle file generated by picking the array created above.
Code for the uncompressed pickle load case:
import pickle
import time
start = time.time()
with open('/tmp/test_file.pickle','rb') as picklefile:
arr = pickle.load(picklefile)
print("Loaded in", time.time()-start)
Loading from this uncompressed pickle file takes ~0.8s on my machine.
The low-hanging fruit
Loading is 2.3x faster than your PIL-based code.
It uses
zipfile.ZIP_DEFLATED
, see savez_compressed docu.Your PIL code also has an unneeded copy:
array(img)
should beasarray(img)
. It only costs 5% of the slow loading time. But after optimization this will be significant and you have to keep in mind which numpy operators create a copy.Fast decompression
According to the zstd benchmarks, when optimizing for decompression lz4 is a good choice. Just plugging this into pickle gives another 2.4x gain and is only 30% slower than uncompressed pickling.
Benchmarks
The load time was measured inside Python (3.7.3), using the minimum wall-clock time over 20 runs on my desktop. According to occasional glances at
top
it always seemed to be running on a single core.For the curious: profiling
I'm not sure if the Python version matters, most work is supposed to happen inside of C libraries. To validate this I've profiled the
pickle + lz4
variant:Most time is spent inside of the Linux kernel, doing
page_fault
and stuff associated with (re-)allocating memory, probably including disk I/O. The high amount ofmemmove
looks suspicious. Probably Python is re-allocating (resizing) the final array every time a new decompressed chunk arrives. If anyone likes to have a closer look: python and perf profiles.Something I think should be fast is
i.e. write a program that generates a source code like
the packed data ends up encoded directly into the .pyc file
For low-entropy data
gzip
decompression should be quite fast (edit: not really surprisinglylzma
is even faster, and it's still a predefined python module)With your "alaska" data this approach gives the following performance on my machine
You can even distribute just the .pyc provided you can control the python version used; the code to load a .pyc in Python 2 was a one liner but is now more convoluted (apparently it was decided that loading .pyc isn't supposed to be convenient).
Note that the compilation of the module is reasonably fast (e.g. the lzma version compiles on my machine in just 0.1 seconds) but it's a pity to waste on disk 11Mb more for no real reason.
You can continue to use your existing PNGs and enjoy the space saving, but gain some speed by using
libvips
. Here is a comparison, but rather than test the speed of my laptop versus yours, I have shown 3 different methods so you can see the relative speed. I used:Then I checked the performance in IPython because it has nice timing functions. As you can see,
pyvips
is 13 times faster than PIL even with PIL 2x faster than the original version because of avoiding array copy:You can use Python-blosc
It is very fast and for small arrays (<2GB) also quite easy to use. On easily compressable data like your example, it is often faster to compress the data for IO operations. (SATA-SSD: about 500 MB/s, PCIe- SSD: up to 3500MB/s) In the decompression step the array allocation is the most costly part. If your images are of similar shape you can avoid repeated memory allocation.
Example
A contigous array is assumed for the following example.
Benchmarks
Timings