I'm using tensorflow with Titan-X GPUs and I've noticed that, when I run the CIFAR10 example, the Volatile GPU-utilization
is pretty constant around 30%, whereas when I train my own model, the Volatile GPU-utilization
is far from steady, it is almost always 0% and spikes at 80/90% before going back to 0%, over and over again.
I thought that this behavior was due to the way I was feeding the data to the network (I was fetching the data after each step, which took some time). But after implementing a queue to feed the data and avoid this latency between steps, the problem persisted (see below for the queuing system).
Any idea?
batch = 128 # size of the batch
x = tf.placeholder("float32", [None, n_steps, n_input])
y = tf.placeholder("float32", [None, n_classes])
# with a capacity of 100 batches, the bottleneck should not be the data feeding
queue = tf.RandomShuffleQueue(capacity=100*batch,
min_after_dequeue=80*batch,
dtypes=[tf.float32, tf.float32],
shapes=[[n_steps, n_input], [n_classes]])
enqueue_op = queue.enqueue_many([x, y])
X_batch, Y_batch = queue.dequeue_many(batch)
sess = tf.Session()
def load_and_enqueue(data):
while True:
X, Y = data.get_next_batch(batch)
sess.run(enqueue_op, feed_dict={x: X, y: Y})
train_thread = threading.Thread(target=load_and_enqueue, args=(data))
train_thread.daemon = True
train_thread.start()
for _ in xrange(max_iter):
sess.run(train_op)
After doing some experiments, I found the answer so I post it since it could be useful to someone else.
First,
get_next_batch
is approximately 15x slower thantrain_op
(thanks to Eric Platon for pointing this out).However, I thought that the queue was being fed up to
capacity
and that only after the training was supposed to begin. Hence, I thought that even ifget_next_batch
was way slower, the queue should hide this latency, in the beginning at least, since it holdscapacity
examples and it would need to fetch new data only after it reachesmin_after_dequeue
which is lower thancapacity
and that it would result in a somehow steady GPU utilization.But actually, the training begins as soon as the queue reaches
min_after_dequeue
examples. Thus, the queue is being dequeued as soon as the queue reachesmin_after_dequeue
examples to run thetrain_op
, and since the time to feed the queue is 15x slower than the execution time oftrain_op
, the number of elements in the queue drops belowmin_after_dequeue
right after the first iteration of thetrain_op
and thetrain_op
has to wait for the queue to reach againmin_after_dequeue
examples.When I force the
train_op
to wait until the queue is fed up tocapacity
(withcapacity = 100*batch
) instead of starting automatically when it reachesmin_after_dequeue
(withmin_after_dequeue=80*batch
), the GPU utilization is steady for like 10 seconds before going back to 0%, which is understandable since the queue reachesmin_after_dequeue
example in less than 10 seconds.