out of core 4D image tif storage as hdf5 python

2020-03-25 06:26发布


I have 27GB of 2D tiff files that represent slices of a movie of 3D images. I want to be able to slice this data as if it were a simple numpy4d array. It looks like dask.array is a good tool for cleanly manipulating the array once it's stored in memory as a hdf5 file.

How can I store these files as an hdf5 file in the first place if they do not all fit into memory. I am new to h5.py and databases in general.



Edit: Use dask.array's imread function

As of dask 0.7.0 you don't need to store your images in HDF5. Use the imread function directly instead:

In [1]: from skimage.io import imread

In [2]: im = imread('foo.1.tiff')

In [3]: im.shape
Out[3]: (5, 5, 3)

In [4]: ls foo.*.tiff
foo.1.tiff  foo.2.tiff  foo.3.tiff  foo.4.tiff

In [5]: from dask.array.image import imread

In [6]: im = imread('foo.*.tiff')

In [7]: im.shape
Out[7]: (4, 5, 5, 3)

Older answer that stores images into HDF5

Data ingest is often the trickiest of problems. Dask.array doesn't have any automatic integration with image files (though this is quite doable if there's sufficient interest.) Fortunately moving data to h5py is easy because h5py supports the numpy slicing syntax. In the following example we'll create an empty h5py Dataset, and then store four tiny tiff files into that dataset in a for loop.

First we get filenames for our images (please forgive the toy dataset. I don't have anything realistic lying around.)

In [1]: from glob import glob
In [2]: filenames = sorted(glob('foo.*.tiff'))
In [3]: filenames
Out[3]: ['foo.1.tiff', 'foo.2.tiff', 'foo.3.tiff', 'foo.4.tiff']

Load in and inspect a sample image

In [4]: from skimage.io import imread
In [5]: im = imread(filenames[0])  # a sample image
In [6]: im.shape  # tiny image
Out[6]: (5, 5, 3)
In [7]: im.dtype
Out[7]: dtype('int8')

Now we'll make an HDF5 file and an HDF5 dataset called '/x' within that file.

In [8]: import h5py
In [9]: f = h5py.File('myfile.hdf5')  # make an hdf5 file
In [10]: out = f.require_dataset('/x', shape=(len(filenames), 5, 5, 3), dtype=im.dtype)

Great, now we can insert our images one at a time into the HDF5 dataset.

In [11]: for i, fn in enumerate(filenames):
   ....:     im = imread(fn)
   ....:     out[i, :, :, :] = im

At this point dask.array can wrap out happily

In [12]: import dask.array as da
In [13]: x = da.from_array(out, chunks=(1, 5, 5, 3))  # treat each image as a single chunk
In [14]: x[::2, :, :, 0].mean()
Out[14]: dask.array<x_3, shape=(), chunks=(), dtype=float64>

If you'd like to see more native support for stacks of images then I encourage you to raise an issue. It would be pretty easy to use dask.array off of your stack of tiff files directly without going through HDF5.

标签: python h5py dask