Chain datasets from multiple HDF5 files/datasets

2019-04-12 00:16发布

问题:

The benefits and simplistic mapping that h5py provides (through HDF5) for persisting datasets on disk is exceptional. I run some analysis on a set of files and store the result into a dataset, one for each file. At the end of this step, I have a set of h5py.Dataset objects which contain 2D arrays. The arrays all have the same number of columns, but different number of rows, i.e., (A,N), (B,N), (C,N), etc.

I would now like to access these multiple 2D arrays as a single array 2D array. That is, I would like to read them on-demand as an array of shape (A+B+C, N).

For this purpose, h5py.Link classes do not help as it works at the level of HDF5 nodes.

Here is some pseudocode:

import numpy as np
import h5py
a = h5py.Dataset('a',data=np.random.random((100, 50)))
b = h5py.Dataset('b',data=np.random.random((300, 50)))
c = h5py.Dataset('c',data=np.random.random((253, 50)))

# I want to view these arrays as a single array
combined = magic_array_linker([a,b,c], axis=1)
assert combined.shape == (100+300+253, 50)

For my purposes, suggestions of copying the arrays into a new file do not work. I'm also open to solving this on the numpy level, but I don't find any suitable options with numpy.view or numpy.concatenate that would work without copying out the data.

Does anybody know of a way to view multiple arrays as a stacked set of arrays, without copying and from h5py.Dataset?

回答1:

First up, I don't think there is a way to do this without copying the data in order to return a single array. As far as I can tell, it's not possible to concatenate numpy views into one array - unless, of course, you create your own wrapper.

Here I demonstrate a proof of concept using Object/Region references. The basic premise is that we make a new dataset in the file which is an array of references to the constituent subarrays. By storing references like this, the subarrays can change size dynamically and indexing the wrapper will always index the correct subarrays.

As this is just a proof of concept, I haven't implemented proper slicing, just very simple indexing. There's also no attempt at error checking - this will almost definitely break in production.

class MagicArray(object):
    """Magically index an array of references
    """
    def __init__(self, file, references, axis=0):
        self.file = file
        self.references = references
        self.axis = axis

    def __getitem__(self, items):
        # We need to modify the indices, so make sure items is a list
        items = list(items)

        for item in items:
            if hasattr(item, 'start'):
                # items is a slice object
                raise ValueError('Slices not implemented')

        for ref in self.references:
            size = self.file[ref].shape[self.axis]

            # Check if the requested index is in this subarray
            # If not, subtract the subarray size and move on
            if items[self.axis] < size:
                item_ref = ref
                break
            else:
                items[self.axis] = items[self.axis] - size

        return self.file[item_ref][tuple(items)]

Here's how you use it:

with h5py.File("/tmp/so_hdf5/test.h5", 'w') as f:
    a = f.create_dataset('a',data=np.random.random((100, 50)))
    b = f.create_dataset('b',data=np.random.random((300, 50)))
    c = f.create_dataset('c',data=np.random.random((253, 50)))

    ref_dtype = h5py.special_dtype(ref=h5py.Reference)
    ref_dataset = f.create_dataset("refs", (3,), dtype=ref_dtype)

    for i, key in enumerate([a, b, c]):
        ref_dataset[i] = key.ref

with h5py.File("/tmp/so_hdf5/test.h5", 'r') as f:
    foo = MagicArray(f, f['refs'], axis=0)
    print(foo[104, 4])
    print(f['b'][4,4])

This should be fairly trivial to extend to fancier indexing (i.e. being able to handle slices), but I can't see how to do so without copying data.

You might be able to subclass from numpy.ndarray and get all the usual methods as well.