可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm currently running an experiment where I scan a target spatially and grab an oscilloscope trace at each discrete pixel. Generally my trace lengths are 200Kpts. After scanning the entire target I assemble these time domain signals spatially and essentially play back a movie of what was scanned. My scan area is 330x220 pixels in size so the entire dataset is larger than RAM on the computer I have to use.

To start with I was just saving each oscilloscope trace as a numpy array and then after my scan completed downsampling/filtering etc and then piecing the movie together in a way that didn't run into memory problems. However, I'm now at a point where I cant downsample as aliasing will occur and thus need to access the raw data.

I've started looking into storing my large 3d data block in an HDF5 dataset using H5py. My main issue is with my chunk size allocation. My incoming data is orthogonal to the plane that i'd like to read it out in. My main options (to my knowledge) of writing my data are:

    #Fast write Slow read
    with h5py.File("test_h5py.hdf5","a") as f:
        dset = f.create_dataset("uncompchunk",(height,width,dataLen),chunks = (1,1,dataLen), dtype = 'f')
        for i in range(height):
            for j in range(width):
                dset[i,j,:] = np.random.random(200000)

    #Slow write Fast read
    with h5py.File("test_h5py.hdf5","a") as f:
        dset = f.create_dataset("uncompchunk",(height,width,dataLen),chunks = (height,width,1), dtype = 'f')
        for i in range(height):
            for j in range(width):
                dset[i,j,:] = np.random.random(200000)

Is there some way I can optimize the two cases so that neither is horribly inefficient to run?

回答1:

You have some performance pitfalls in your code.

You are using some sort of fancy indexing in the line (don't change the number of array dims when reading/ writing to a HDF5-Dataset.
Set up a proper chunk-cache size, if you are not reading or writing whole chunks. https://stackoverflow.com/a/42966070/4045774
Reduce the amount of read or write calls to the HDF5- Api.
Choose an appropiate chunk size (chunks can only be read/written entirely, so if you only need one part of a chunk the rest should stay in cache)

The following example uses caching by the HDF5-API. To set up a proper cache size I will use h5py_cache. https://pypi.python.org/pypi/h5py-cache/1.0.1

You could further improve the performance if you do the caching yourself. (read and write whole chunks)

Writing

# minimal chache size for reasonable performance would be 20*20*dataLen*4= 320 MB, lets take a bit more
with h5py_cache.File(h5pyfile, 'r+',chunk_cache_mem_size=500*1024**2) as f:
    dset = f.create_dataset("uncompchunk",(height,width,dataLen),chunks = (20,20,20), dtype = 'f')
    for i in range(height):
        for j in range(width):
            # avoid fancy slicing
            dset[i:i+1,j:j+1,:] = expand_dims(expand_dims(np.random.random(200000),axis=0),axis=0)

Reading

# minimal chache size for reasonable performance would be height*width*500*4= 145 MB, lets take a bit more
with h5py_cache.File(h5pyfile, 'r+',chunk_cache_mem_size=200*1024**2) as f:
     dset=f["uncompchunk"]
     for i in xrange(0,dataLen):
         Image=np.squeeze(dset[:,:,i:i+1])