Training of keras model get's slower after eac

2020-08-13 01:54发布

问题:

I'm writing some code to optimize a neural net architecture and so have a python function create_nn(parms) that creates and initializes a keras model. However, the problem I'm having is that after a fewer iterations the models take a lot longer to train than usual (initally one epoch takes 10sec, and then after roughly the 14th model (each model trains for 20 epochs) it takes 60sec/epoch). I know that this is not because of the evolving architecture because if I restart the script and start were it ended, it is back to normal speeds.

I'm currently running

from keras import backend as K

and then a

K.clear_session()

after training any given new model.

Some additional details:

  • For the first 12 models, training time per epoch remains roughly constant at 10sec/epoch. Then at the 13th model training time per epoch climbs steadily to 60sec. Then training time per epoch hovers at around 60sec/epoch.

  • I'm running keras with Tensorflow as the backend

  • I'm using an Amazon EC2 t2.xlarge instance

  • There is plenty of free RAM (7GB free, w/ the dataset of size 5GB)

I've removed a bunch of layers and parameters, but essentially create_nn looks like:

def create_nn(features, timesteps, number_of_filters):
    inputs = Input(shape = (timesteps, features))
    x = GaussianNoise(stddev=0.005)(inputs)
    #Layer 1.1
    x = Convolution1D(number_of_filters, 3, padding='valid')(x)
    x = Activation('relu')(x)
    x = Flatten()(x)
    x = Dense(10)(x)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)
    x = Dropout(0.5)(x)
    # Output layer
    outputs = Dense(1, activation='sigmoid')(x)
    model = Model(inputs=inputs, outputs=outputs)

    # Compile and Return
    model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
    print('CNN model built succesfully.')
    return model

Note that while a Sequential model would've worked in this dummy example, the functional API is required for the actual usecase.

How can I fix this problem?

回答1:

Why is my training time increasing after every run?

Short answer : you need to use tf.keras.backend.clear_session() before every new model that you create.

This problem only seems to happen when eager execution is turned off.

Okay, so let's run an experiment with and without clear_session. The code for make_model is at the end of this response.

First, let's look at the training time when using clear session. We'll run this experiment 10 times an print the results

Use tf.keras.backend.clear_session()

non_seq_time = [ make_model(clear_session=True) for _ in range(10)]

With clear_session=True

non sequential
Elapse =  1.06039
Elapse =  1.20795
Elapse =  1.04357
Elapse =  1.03374
Elapse =  1.02445
Elapse =  1.00673
Elapse =  1.01712
Elapse =    1.021
Elapse =  1.17026
Elapse =  1.04961

As you can see, the training time stays about constant

Now let's re-run the experiment without using clear session and review the training time

Don't use tf.keras.backend.clear_session()

non_seq_time = [ make_model(clear_session=False) for _ in range(10)]

With clear_session=False

non sequential
Elapse =  1.10954
Elapse =  1.13042
Elapse =  1.12863
Elapse =   1.1772
Elapse =   1.2013
Elapse =  1.31054
Elapse =  1.27734
Elapse =  1.32465
Elapse =  1.32387
Elapse =  1.33252

as you can see, the training time increases without clear_session

Full Code Example

# Training time increases - and how to fix it

# Setup and imports

# %tensorflow_version 2.x

import tensorflow as tf
import tensorflow.keras.layers as layers
import tensorflow.keras.models as models
from time import time

# if you comment this out, the problem doesn't happen
# it only happens when eager execution is disabled !!
tf.compat.v1.disable_eager_execution()


(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()


# Let's build that network
def make_model(activation="relu", hidden=2, units=100, clear_session=False):
    # -----------------------------------
    # .     HERE WE CAN TOGGLE CLEAR SESSION
    # -----------------------------------
    if clear_session:
        tf.keras.backend.clear_session()

    start = time()
    inputs = layers.Input(shape=[784])
    x = inputs

    for num in range(hidden) :
        x = layers.Dense(units=units, activation=activation)(x)

    outputs = layers.Dense(units=10, activation="softmax")(x)
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    results = model.fit(x_train, y_train, validation_data=(x_test, y_test), batch_size=200, verbose=0)
    elapse = time()-start
    print(f"Elapse = {elapse:8.6}")
    return elapse

# Let's try it out and time it

# prime it first
make_model()

print("Use clear session")
non_seq_time = [ make_model(clear_session=True) for _ in range(10)]

print("Don't use clear session")
non_seq_time = [ make_model(clear_session=False) for _ in range(10)]