Memory error when using Keras ImageDataGenerator

2020-07-03 06:23发布

问题:

I am attempting to predict features in imagery using keras with a TensorFlow backend. Specifically, I am attempting to use a keras ImageDataGenerator. The model is set to run for 4 epochs and runs fine until the 4th epoch where it fails with a MemoryError.

I am running this model on an AWS g2.2xlarge instance running Ubuntu Server 16.04 LTS (HVM), SSD Volume Type.

The training images are 256x256 RGB pixel tiles (8 bit unsigned) and the training mask is 256x256 single band (8 bit unsigned) tiled data where 255 == a feature of interest and 0 == everything else.

The following 3 functions are the ones pertinent to this error.

How can I resolve this MemoryError?


def train_model():
        batch_size = 1
        training_imgs = np.lib.format.open_memmap(filename=os.path.join(data_path, 'data.npy'),mode='r+')
        training_masks = np.lib.format.open_memmap(filename=os.path.join(data_path, 'mask.npy'),mode='r+')
        dl_model = create_model()
        print(dl_model.summary())
        model_checkpoint = ModelCheckpoint(os.path.join(data_path,'mod_weight.hdf5'), monitor='loss',verbose=1, save_best_only=True)
        dl_model.fit_generator(generator(training_imgs, training_masks, batch_size), steps_per_epoch=(len(training_imgs)/batch_size), epochs=4,verbose=1,callbacks=[model_checkpoint])

def generator(train_imgs, train_masks=None, batch_size=None):

# Create empty arrays to contain batch of features and labels#

        if train_masks is not None:
                train_imgs_batch = np.zeros((batch_size,y_to_res,x_to_res,bands))
                train_masks_batch = np.zeros((batch_size,y_to_res,x_to_res,1))

                while True:
                        for i in range(batch_size):
                                # choose random index in features
                                index= random.choice(range(len(train_imgs)))
                                train_imgs_batch[i] = train_imgs[index]
                                train_masks_batch[i] = train_masks[index]
                        yield train_imgs_batch, train_masks_batch
        else:
                rec_imgs_batch = np.zeros((batch_size,y_to_res,x_to_res,bands))
                while True:
                        for i in range(batch_size):
                                # choose random index in features
                                index= random.choice(range(len(train_imgs)))
                                rec_imgs_batch[i] = train_imgs[index]
                        yield rec_imgs_batch

def train_generator(train_images,train_masks,batch_size):
        data_gen_args=dict(rotation_range=90.,horizontal_flip=True,vertical_flip=True,rescale=1./255)
        image_datagen = ImageDataGenerator()
        mask_datagen = ImageDataGenerator()
# # Provide the same seed and keyword arguments to the fit and flow methods
        seed = 1
        image_datagen.fit(train_images, augment=True, seed=seed)
        mask_datagen.fit(train_masks, augment=True, seed=seed)
        image_generator = image_datagen.flow(train_images,batch_size=batch_size)
        mask_generator = mask_datagen.flow(train_masks,batch_size=batch_size)
        return zip(image_generator, mask_generator)

The following os the output from the model detailing the epochs and the error message:

Epoch 00001: loss improved from inf to 0.01683, saving model to /home/ubuntu/deep_learn/client_data/mod_weight.hdf5
Epoch 2/4
7569/7569 [==============================] - 3394s 448ms/step - loss: 0.0049 - binary_crossentropy: 0.0027 - jaccard_coef_int: 0.9983  

Epoch 00002: loss improved from 0.01683 to 0.00492, saving model to /home/ubuntu/deep_learn/client_data/mod_weight.hdf5
Epoch 3/4
7569/7569 [==============================] - 3394s 448ms/step - loss: 0.0049 - binary_crossentropy: 0.0026 - jaccard_coef_int: 0.9982  

Epoch 00003: loss improved from 0.00492 to 0.00488, saving model to /home/ubuntu/deep_learn/client_data/mod_weight.hdf5
Epoch 4/4
7569/7569 [==============================] - 3394s 448ms/step - loss: 0.0074 - binary_crossentropy: 0.0042 - jaccard_coef_int: 0.9975  

Epoch 00004: loss did not improve
Traceback (most recent call last):
  File "image_rec.py", line 291, in <module>
    train_model()
  File "image_rec.py", line 208, in train_model
    dl_model.fit_generator(train_generator(training_imgs,training_masks,batch_size),steps_per_epoch=1,epochs=1,workers=1)
  File "image_rec.py", line 274, in train_generator
    image_datagen.fit(train_images, augment=True, seed=seed)
  File "/home/ubuntu/pyvirt_test/local/lib/python2.7/site-packages/keras/preprocessing/image.py", line 753, in fit
    x = np.copy(x)
  File "/home/ubuntu/pyvirt_test/local/lib/python2.7/site-packages/numpy/lib/function_base.py", line 1505, in copy
    return array(a, order=order, copy=True)
MemoryError

回答1:

it seems your problem is due to the data is too huge. I can see two solutions. The first one is run your code in a distributed system by means of spark, I guess you do not have this support, so let us move on to the other.

The second one is which I think is viable. I would slice the data and I would try feeding the model incrementally. We can do this with Dask. This library can slice the data and save in objects which then you can retrieve reading from disk, only in the part you want.

If you have a image which size is an matrix of 100x100, we can retrieve each array without the needed to load the 100 arrays in memory. We can load array by array in memory (releasing the previous one), which would be the input in your Neural Network.

To do this, you can to transform your np.array to dask array and assign the partitions. For example:

>>> k = np.random.randn(10,10) # Matrix 10x10
>>> import dask.array as da
>>> k2 = da.from_array(k,chunks = 3)
dask.array<array, shape=(10, 10), dtype=float64, chunksize=(3, 3)>
>>> k2.to_delayed()
array([[Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 0, 0)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 0, 1)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 0, 2)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 0, 3))],
   [Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 1, 0)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 1, 1)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 1, 2)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 1, 3))],
   [Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 2, 0)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 2, 1)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 2, 2)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 2, 3))],
   [Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 3, 0)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 3, 1)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 3, 2)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 3, 3))]],
  dtype=object)

Here, you can see how the data is saved in objects, and then you can retrieve in parts to feed your model.

To implement this solution you must introduce a loop in your function which call each partition and feed the NN to get the incremental trainning.

For more information, see Dask documentation



回答2:

You provided quite confusing code (in my opinion), ie. no call to the train_generator is visible. I am not sure that this is a problem of insufficient memore due to a big data, since you use memmap for that, but lets assume for now it is.

  • If the data is quite big and since you're loading the images from directory anyway, it might be worthy considering to use ImageDataGenerator's flow_from_directory method. It would require a slight change of design, tho, which might not be what you want.

You can load it in the following manner:

train_datagen = ImageDataGenerator()
train_generator = train_datagen.flow_from_directory(
        'data/train',
        target_size=(256, 256),
        batch_size=batch_size,
        ...  # other configurations)

More on that in the Keras documentation.

  • Also note that if you have 32bit, the memmap does not allow more than 2GB.

  • Do you use tensorflow-gpu, by any chance? Maybe your gpu is not sufficient, you could try this with the tensorflow package.

I would strongly suggest to try some memory profiling to see where bigger allocations of memory happen.


If it was not the case of insufficient memory, It might be wrong handling of the data in your model, since your loss function is not improving at all, it could be miswired for example.


Finally, the last note here .. it is good practice to load the memmap of training data as read-only, since you don't want to accidentaly mess the data.

UPDATE: I can see that you've updated the post and provided the code for the train_generator method, but there is still no call to that method in your call.

If I assume that you have a typo in the call - train_generator instead of the generator method in your d1_model.fit_generator method, it is possible that the fit_generator method is not working on a batch of data, but actually on the whole training_imgs and it copys over the whole set in the np.copy(x) call.

Also, as mentioned already, there indeed are (you can find some of them, fe. here is an open one) a few issues with Keras memory leak when using the fit and fit_generator methods.



回答3:

This is common when running 32bit if the float precision is too high. Are you running 32bit? You may also consider casting or rounding the array.



回答4:

Generally Keras/Tensorflow is very good with resource usage, but there is a known memory leak that has caused problems in the past. To make sure that's not the one causing your problems, try including these two lines of code to your training script:

# load the backend
from keras import backend as K

# prevent Tensorflow memory leakage
K.clear_session()


回答5:

I met the same problem recently. Somehow the the FCN-8 code can run successfully on my tensorflow1.2+keras2.0.9+8G RAM +1060 computer, but occurred memory error when using modelcheckpoint on my tf1.4+keras2.1.5+16g ram +1080ti computer.