Tensorflow memory leak when building graph in a lo

2019-04-11 09:43发布

问题:

I noticed this when my grid search for selecting hyper-parameters of a Tensorflow (version 1.12.0) model crashed due to explosion in memory consumption.

Notice that unlike similar-looking question here, I do close the graph and session (using context managers), and I am not adding nodes to the graph in the loop.

I suspected that maybe tensorflow maintains global variables that do not get cleared between iterations, so I called globals() before and after an iteration but did not observe any difference in the set of global variable before and after each iteration.

I made a small example that reproduces the problem. I train a simple MNIST classifier in a loop and plot the memory consumed by the process:

import matplotlib.pyplot as plt
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import psutil
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
process = psutil.Process(os.getpid())

N_REPS = 100
N_ITER = 10
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
x_test, y_test = mnist.test.images, mnist.test.labels

# Runs experiment several times.
mem = []
for i in range(N_REPS):
    with tf.Graph().as_default():
        net = tf.contrib.layers.fully_connected(x_test, 200)
        logits = tf.contrib.layers.fully_connected(net, 10, activation_fn=None)
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_test, logits=logits))
        train_op = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(loss)
        init = tf.global_variables_initializer()
        with tf.Session() as sess:
            # training loop.
            sess.run(init)
            for _ in range(N_ITER):
                sess.run(train_op)
    mem.append(process.memory_info().rss)
plt.plot(range(N_REPS), mem)

And the resulting plot looks like this:

In my actual project, process memory starts from a couple of hundreds MB (depending on dataset size), and goes up to 64 GB until my system run out of memory. There are things that I tried that slow down the increase, such as using placeholders and feed_dicts instead of relying on convert_to_tensor. But the constant increase is still there, only slower.

回答1:

You need to clear the graph after each iteration of your for loop before instantiating a new graph. Adding tf.reset_default_graph() at the end of your for loop should resolve your memory leak issue.

for i in range(N_REPS):
    with tf.Graph().as_default():
        net = tf.contrib.layers.fully_connected(x_test, 200)
        ...
    mem.append(process.memory_info().rss)
    tf.reset_default_graph()