Optimising HDF5 dataset for Read/Write speed

2020-08-04 06:25发布

问题:

I'm currently running an experiment where I scan a target spatially and grab an oscilloscope trace at each discrete pixel. Generally my trace lengths are 200Kpts. After scanning the entire target I assemble these time domain signals spatially and essentially play back a movie of what was scanned. My scan area is 330x220 pixels in size so the entire dataset is larger than RAM on the computer I have to use.

To start with I was just saving each oscilloscope trace as a numpy array and then after my scan completed downsampling/filtering etc and then piecing the movie together in a way that didn't run into memory problems. However, I'm now at a point where I cant downsample as aliasing will occur and thus need to access the raw data.

I've started looking into storing my large 3d data block in an HDF5 dataset using H5py. My main issue is with my chunk size allocation. My incoming data is orthogonal to the plane that i'd like to read it out in. My main options (to my knowledge) of writing my data are:

    #Fast write Slow read
    with h5py.File("test_h5py.hdf5","a") as f:
        dset = f.create_dataset("uncompchunk",(height,width,dataLen),chunks = (1,1,dataLen), dtype = 'f')
        for i in range(height):
            for j in range(width):
                dset[i,j,:] = np.random.random(200000)

or

    #Slow write Fast read
    with h5py.File("test_h5py.hdf5","a") as f:
        dset = f.create_dataset("uncompchunk",(height,width,dataLen),chunks = (height,width,1), dtype = 'f')
        for i in range(height):
            for j in range(width):
                dset[i,j,:] = np.random.random(200000)     

Is there some way I can optimize the two cases so that neither is horribly inefficient to run?

回答1:

You have some performance pitfalls in your code.

  1. You are using some sort of fancy indexing in the line (don't change the number of array dims when reading/ writing to a HDF5-Dataset.
  2. Set up a proper chunk-cache size, if you are not reading or writing whole chunks. https://stackoverflow.com/a/42966070/4045774

  3. Reduce the amount of read or write calls to the HDF5- Api.

  4. Choose an appropiate chunk size (chunks can only be read/written entirely, so if you only need one part of a chunk the rest should stay in cache)

The following example uses caching by the HDF5-API. To set up a proper cache size I will use h5py_cache. https://pypi.python.org/pypi/h5py-cache/1.0.1

You could further improve the performance if you do the caching yourself. (read and write whole chunks)

Writing

# minimal chache size for reasonable performance would be 20*20*dataLen*4= 320 MB, lets take a bit more
with h5py_cache.File(h5pyfile, 'r+',chunk_cache_mem_size=500*1024**2) as f:
    dset = f.create_dataset("uncompchunk",(height,width,dataLen),chunks = (20,20,20), dtype = 'f')
    for i in range(height):
        for j in range(width):
            # avoid fancy slicing
            dset[i:i+1,j:j+1,:] = expand_dims(expand_dims(np.random.random(200000),axis=0),axis=0)

Reading

# minimal chache size for reasonable performance would be height*width*500*4= 145 MB, lets take a bit more
with h5py_cache.File(h5pyfile, 'r+',chunk_cache_mem_size=200*1024**2) as f:
     dset=f["uncompchunk"]
     for i in xrange(0,dataLen):
         Image=np.squeeze(dset[:,:,i:i+1])


回答2:

If you want to optimise your I/O performance with chunking you should read these two articles from unidata:

chunking general

optimising for access pattern

And if you are only going for raw I/O performance consider @titusjan advice