Is it possible to np.concatenate memory-mapped fil

2019-01-26 07:59发布

问题:

I saved a couple of numpy arrays with np.save(), and put together they're quite huge.

Is it possible to load them all as memory-mapped files, and then concatenate and slice through all of them without ever loading anythin into memory?

回答1:

Using numpy.concatenate apparently load the arrays into memory. To avoid this you can easily create a thrid memmap array in a new file and read the values from the arrays you wish to concatenate. In a more efficient way, you can also append new arrays to an already existing file on disk.

For any case you must choose the right order for the array (row-major or column-major).

The following examples illustrate how to concatenate along axis 0 and axis 1.


1) concatenate along axis=0

a = np.memmap('a.array', dtype='float64', mode='w+', shape=( 5000,1000)) # 38.1MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(15000,1000)) # 114 MB
b[:,:] = 222

You can define a third array reading the same file as the first array to be concatenated (here a) in mode r+ (read and append), but with the shape of the final array you want to achieve after concatenation, like:

c = np.memmap('a.array', dtype='float64', mode='r+', shape=(20000,1000), order='C')
c[5000:,:] = b

Concatenating along axis=0 does not require to pass order='C' because this is already the default order.


2) concatenate along axis=1

a = np.memmap('a.array', dtype='float64', mode='w+', shape=(5000,3000)) # 114 MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(5000,1000)) # 38.1MB
b[:,:] = 222

The arrays saved on disk are actually flattened, so if you create c with mode=r+ and shape=(5000,4000) without changing the array order, the 1000 first elements from the second line in a will go to the first in line in c. But you can easily avoid this passing order='F' (column-major) to memmap:

c = np.memmap('a.array', dtype='float64', mode='r+',shape=(5000,4000), order='F')
c[:, 3000:] = b

Here you have an updated file 'a.array' with the concatenation result. You may repeat this process to concatenate in pairs of two.

Related questions:

  • Working with big data in python and numpy, not enough ram, how to save partial results on disc?


回答2:

If u use order='F',will leads another problem, which when u load the file next time it will be quit a mess even pass the order='F. So my solution is below, I have test a lot, it just work fine.

fp = your old memmap...
shape = fp.shape
data = your ndarray...
data_shape = data.shape
concat_shape = data_shape[:-1] + (data_shape[-1] + shape[-1],)
print('cancat shape:{}'.format(concat_shape))
new_fp = np.memmap(new_file_name, dtype='float32', mode='r+', shape=concat_shape)
if len(concat_shape) == 1:
    new_fp[:shape[0]] = fp[:]
    new_fp[shape[0]:] = data[:]
if len(concat_shape) == 2:
    new_fp[:, :shape[-1]] = fp[:]
    new_fp[:, shape[-1]:] = data[:]
elif len(concat_shape) == 3:
    new_fp[:, :, :shape[-1]] = fp[:]
    new_fp[:, :, shape[-1]:] = data[:]
fp = new_fp
fp.flush()