I have 27GB of 2D tiff files that represent slices of a movie of 3D images. I want to be able to slice this data as if it were a simple numpy4d array. It looks like dask.array is a good tool for cleanly manipulating the array once it's stored in memory as a hdf5 file.
How can I store these files as an hdf5 file in the first place if they do not all fit into memory. I am new to h5.py and databases in general.
Thanks.
Edit: Use dask.array
's imread
function
As of dask 0.7.0
you don't need to store your images in HDF5. Use the imread
function directly instead:
In [1]: from skimage.io import imread
In [2]: im = imread('foo.1.tiff')
In [3]: im.shape
Out[3]: (5, 5, 3)
In [4]: ls foo.*.tiff
foo.1.tiff foo.2.tiff foo.3.tiff foo.4.tiff
In [5]: from dask.array.image import imread
In [6]: im = imread('foo.*.tiff')
In [7]: im.shape
Out[7]: (4, 5, 5, 3)
Older answer that stores images into HDF5
Data ingest is often the trickiest of problems. Dask.array doesn't have any automatic integration with image files (though this is quite doable if there's sufficient interest.) Fortunately moving data to h5py
is easy because h5py
supports the numpy slicing syntax. In the following example we'll create an empty h5py Dataset, and then store four tiny tiff files into that dataset in a for loop.
First we get filenames for our images (please forgive the toy dataset. I don't have anything realistic lying around.)
In [1]: from glob import glob
In [2]: filenames = sorted(glob('foo.*.tiff'))
In [3]: filenames
Out[3]: ['foo.1.tiff', 'foo.2.tiff', 'foo.3.tiff', 'foo.4.tiff']
Load in and inspect a sample image
In [4]: from skimage.io import imread
In [5]: im = imread(filenames[0]) # a sample image
In [6]: im.shape # tiny image
Out[6]: (5, 5, 3)
In [7]: im.dtype
Out[7]: dtype('int8')
Now we'll make an HDF5 file and an HDF5 dataset called '/x'
within that file.
In [8]: import h5py
In [9]: f = h5py.File('myfile.hdf5') # make an hdf5 file
In [10]: out = f.require_dataset('/x', shape=(len(filenames), 5, 5, 3), dtype=im.dtype)
Great, now we can insert our images one at a time into the HDF5 dataset.
In [11]: for i, fn in enumerate(filenames):
....: im = imread(fn)
....: out[i, :, :, :] = im
At this point dask.array
can wrap out
happily
In [12]: import dask.array as da
In [13]: x = da.from_array(out, chunks=(1, 5, 5, 3)) # treat each image as a single chunk
In [14]: x[::2, :, :, 0].mean()
Out[14]: dask.array<x_3, shape=(), chunks=(), dtype=float64>
If you'd like to see more native support for stacks of images then I encourage you to raise an issue. It would be pretty easy to use dask.array
off of your stack of tiff files directly without going through HDF5.