I have a script that generates two-dimensional numpy
array
s with dtype=float
and shape on the order of (1e3, 1e6)
. Right now I'm using np.save
and np.load
to perform IO operations with the arrays. However, these functions take several seconds for each array. Are there faster methods for saving and loading the entire arrays (i.e., without making assumptions about their contents and reducing them)? I'm open to converting the array
s to another type before saving as long as the data are retained exactly.
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
According to my experience, np.save()&np.load() is the fastest solution when trasfering data between hard disk and memory so far. I've heavily relied my data loading on database and HDFS system before I realized this conclusion. My tests shows that: The database data loading(from hard disk to memory) bandwidth could be around 50 MBps(Byets/Second), but the np.load() bandwidth is almost same as my hard disk maximum bandwidth: 2GBps(Byets/Second). Both test environments use the simplest data structure.
And I don't think it's a problem to use several seconds to loading an array with shape: (1e3, 1e6). E.g. Your array shape is (1000, 1000000), its data type is float128, then the pure data size is (128/8)*1000*1,000,000=16,000,000,000=16GBytes and if it takes 4 seconds, Then your data loading bandwidth is 16GBytes/4Seconds = 4GBps. SATA3 maximum bandwidth is 600MBps=0.6GBps, your data loading bandwidth is already 6 times of it, your data loading performance almost could compete with DDR's maximum bandwidth, what else do you want?
So my final conclusion is:
Don't use python's Pickle, don't use any database, don't use any big data system to store your data into hard disk, if you could use np.save() and np.load(). These two functions are the fastest solution to transfer data between harddisk and memory so far.
I've also tested the HDF5 , and found that it's mush slower than np.load() and np.save(), so use np.save()&np.load() if you've enough DDR memory in your platfrom.
For really big arrays, I've heard about several solutions, and they mostly on being lazy on the I/O :
ndarray
(Any class accepting ndarray acceptsmemmap
)Use Python bindings for HDF5, a bigdata-ready file format, like PyTables or h5py
Python's pickling system (out of the race, mentioned for Pythonicity rather than speed)
Numpy.memmap
From the docs of NumPy.memmap :
HDF5 arrays
From the h5py doc
The format supports compression of data in various ways (more bits loaded for same I/O read), but this means that the data becomes less easy to query individually, but in your case (purely loading / dumping arrays) it might be efficient
Here is a comparison with PyTables.
I cannot get up to
(int(1e3), int(1e6)
due to memory restrictions. Therefore, I used a smaller array:NumPy
save
:NumPy
load
:PyTables writing:
PyTables reading:
The numbers are very similar. So no real gain wit PyTables here. But we are pretty close to the maximum writing and reading rate of my SSD.
Writing:
Reading:
Compression does not really help due to the randomness of the data:
Reading of the compressed data becomes a bit slower:
This is different for regular data:
Writing is significantly faster:
1 loops, best of 3: 849 ms per loop
The same holds true for reading:
Conclusion: The more regular your data the faster it should get using PyTables.