How does one move data to multiple GPU towers usin

We are running multi GPU jobs on Tensorflow and evaluating a migration from the queue based model (using the string_input_producer interface) to the new Tensorflow Dataset API. The latter appears to offer an easier way to switch between Train and Validation, concurrently.

A snippet of code below shows how we are doing this.

    train_dataset, train_iterator = get_dataset(train_files, batch_size, epochs)
    val_dataset, val_iterator = get_dataset(val_files, batch_size, epochs)


    is_validating = tf.placeholder(dtype=bool, shape=())
    next_batch = tf.cond(is_validating,
               lambda: val_iterator.get_next(),
               lambda: train_iterator.get_next())

    validation_tower = self.num_gpus - 1
    tower_grads = []

    for i in range(self.num_gpus):
        with tf.variable_scope(tf.get_variable_scope(),reuse=(i > 0)):
            with tf.device('/gpu:%d' % i), tf.name_scope('%s_%d' % ('gpu_', i)) as scope:
                if i == validation_tower:
                    images, labels = next_batch
                    # Loss funcs snipped out
                else:
                    images, labels = next_batch
                    # Loss funcs snipped out

The get_dataset function builds a dataset, sets a map function and a batch size. It also builds an iterator, but doesn't initialize it. Initialization of the iterator occurs before the session starts.

The is_validating boolean is supplied while the session is running, and every few steps we pass is_validating as True via a feed_dict to use the validation dataset

The question I have is:

Lets say I have 8 gpus, so we run training on 7 GPUs. Does the Iterator advance from the same point for each of these 7 GPUs, hence supplying all 7 GPU's with the same data?

At present there are three main options, which have different usability and performance trade-offs:

In the Dataset.batch() transform, create a single large batch containing examples for all of your GPUs. Then use tf.split(..., self.num_gpus) on the output of Iterator.get_next() to create sub-batches for each GPU. This is probably the easiest approach, but it does place the splitting on the critical path.
In the Dataset.batch() transform, create a mini-batch that is sized for a single GPU. Then call Iterator.get_next() once per GPU to get multiple different batches. (By contrast, in your current code, the same value of next_batch is sent to each GPU, which is probably not what you wanted to happen.)
Create multiple iterators, one per GPU. Shard the data using Dataset.shard() early in the pipeline (e.g. on the list of files if your dataset is sharded). Note that this approach will consume more resources on the host, so you may need to dial down any buffer sizes and/or degrees of parallelism

Note that the current tf.data pipelines run on the CPU only, and an important aspect of an efficient pipeline is staging your training input to the GPU while the previous step is still running. See the TensorFlow CNN benchmarks for example code that shows how to stage data to GPUs efficiently. We are currently working on adding this support to the tf.data API directly.