I am looking for a fast way to preserve large numpy arrays. I want to save them to the disk in a binary format, then read them back into memory relatively fastly. cPickle is not fast enough, unfortunately.
I found numpy.savez and numpy.load. But the weird thing is, numpy.load loads a npy file into "memory-map". That means regular manipulating of arrays really slow. For example, something like this would be really slow:
#!/usr/bin/python
import numpy as np;
import time;
from tempfile import TemporaryFile
n = 10000000;
a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5
file = TemporaryFile()
np.savez(file,a = a, b = b, c = c);
file.seek(0)
t = time.time()
z = np.load(file)
print "loading time = ", time.time() - t
t = time.time()
aa = z['a']
bb = z['b']
cc = z['c']
print "assigning time = ", time.time() - t;
more precisely, the first line will be really fast, but the remaining lines that assign the arrays to obj
are ridiculously slow:
loading time = 0.000220775604248
assining time = 2.72940087318
Is there any better way of preserving numpy arrays? Ideally, I want to be able to store multiple arrays in one file.
I'm a big fan of hdf5 for storing large numpy arrays. There are two options for dealing with hdf5 in python:
http://www.pytables.org/
http://www.h5py.org/
Both are designed to work with numpy arrays efficiently.
I've compared performance (space and time) for a number of ways to store numpy arrays. Few of them support multiple arrays per file, but perhaps it's useful anyway.
Npy and binary files are both really fast and small for dense data. If the data is sparse or very structured, you might want to use npz with compression, which'll save a lot of space but cost some load time.
If portability is an issue, binary is better than npy. If human readability is important, then you'll have to sacrifice a lot of performance, but it can be achieved fairly well using csv (which is also very portable of course).
More details and the code are available at the github repo.
The lookup time is slow because when you use
mmap
to does not load content of array to memory when you invokeload
method. Data is lazy loaded when particular data is needed. And this happens in lookup in your case. But second lookup won`t be so slow.This is nice feature of
mmap
when you have a big array you do not have to load whole data into memory.To solve your can use joblib you can dump any object you want using
joblib.dump
even two or morenumpy arrays
, see the examplesavez() save data in a zip file, It may take some time to zip & unzip the file. You can use save() & load() function:
To save multiple arrays in one file, you just need to open the file first, and then save or load the arrays in sequence.
There is now a HDF5 based clone of
pickle
calledhickle
!https://github.com/telegraphic/hickle
EDIT:
There also is the possibility to "pickle" directly into a compressed archive by doing:
Appendix
Another possibility to store numpy arrays efficiently is Bloscpack:
and the output for my laptop (a relatively old MacBook Air with a Core2 processor):
that means that it can store really fast, i.e. the bottleneck is typically the disk. However, as the compression ratios are pretty good here, the effective speed is multiplied by the compression ratios. Here are the sizes for these 76 MB arrays:
Please note that the use of the Blosc compressor is fundamental for achieving this. The same script but using 'clevel' = 0 (i.e. disabling compression):
is clearly bottlenecked by the disk performance.