As per TensorFlow documentation , the prefetch
and map
methods of tf.contrib.data.Dataset
class, both have a parameter called buffer_size
.
For prefetch
method, the parameter is known as buffer_size
and according to documentation :
buffer_size: A tf.int64 scalar tf.Tensor, representing the maximum number elements that will be buffered when prefetching.
For the map
method, the parameter is known as output_buffer_size
and according to documentation :
output_buffer_size: (Optional.) A tf.int64 scalar tf.Tensor, representing the maximum number of processed elements that will be buffered.
Similarly for the shuffle
method, the same quantity appears and according to documentation :
buffer_size: A tf.int64 scalar tf.Tensor, representing the number of elements from this dataset from which the new dataset will sample.
What is the relation between these parameters ?
Suppose I create aDataset
object as follows :
tr_data = TFRecordDataset(trainfilenames)
tr_data = tr_data.map(providefortraining, output_buffer_size=10 * trainbatchsize, num_parallel_calls\
=5)
tr_data = tr_data.shuffle(buffer_size= 100 * trainbatchsize)
tr_data = tr_data.prefetch(buffer_size = 10 * trainbatchsize)
tr_data = tr_data.batch(trainbatchsize)
What role is being played by the buffer
parameters in the above snippet ?
Importance of
buffer_size
inshuffle()
I wanted to follow up on the previous answer from @mrry to stress the importance of
buffer_size
intf.data.Dataset.shuffle()
.Having a low
buffer_size
will not just give you inferior shuffling in some cases: it can mess up your whole training.A practical example: cat classifier
Suppose for instance that you are training a cat classifier on images, and your data is organized in the following way (with
10000
images in each category):A standard way to input data with
tf.data
can be to have a list of filenames and a list of corresponding labels, and usetf.data.Dataset.from_tensor_slices()
to create the dataset:The big issue with the code above is that the dataset will actually not be shuffled in the right way. For about the first half of an epoch, we will only see cat images, and for the second half only non cat images. This will hurt training a lot.
At the beginning of training, the dataset will take the first
1000
filenames and put them in its buffer, then pick one at random among them. Since all the first1000
images are images of cat, we will only pick cat images at the beginning.The fix here is to make sure that
buffer_size
is larger than20000
, or to shuffle in advancefilenames
andlabels
(with the same indices obviously).Since storing all the filenames and labels in memory is not an issue, we can actually use
buffer_size = len(filenames)
to make sure that everything will be shuffled together. Make sure to calltf.data.Dataset.shuffle()
before applying the heavy transformations (like reading the images, processing them, batching...).The takeaway is to always double check what the shuffling will do. A good way to catch these errors might be to plot the distribution of batches over time (make sure that batches contain about the same distribution as the training set, half cat and half non cat in our example).
TL;DR Despite their similar names, these arguments have quite difference meanings. The
buffer_size
inDataset.shuffle()
can affect the randomness of your dataset, and hence the order in which elements are produced. Thebuffer_size
inDataset.prefetch()
only affects the time it takes to produce the next element.The
buffer_size
argument intf.data.Dataset.prefetch()
and theoutput_buffer_size
argument intf.contrib.data.Dataset.map()
provide a way to tune the performance of your input pipeline: both arguments tell TensorFlow to create a buffer of at mostbuffer_size
elements, and a background thread to fill that buffer in the background. (Note that we removed theoutput_buffer_size
argument fromDataset.map()
when it moved fromtf.contrib.data
totf.data
. New code should useDataset.prefetch()
aftermap()
to get the same behavior.)Adding a prefetch buffer can improve performance by overlapping the preprocessing of data with downstream computation. Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.
By contrast, the
buffer_size
argument totf.data.Dataset.shuffle()
affects the randomness of the transformation. We designed theDataset.shuffle()
transformation (like thetf.train.shuffle_batch()
function that it replaces) to handle datasets that are too large to fit in memory. Instead of shuffling the entire dataset, it maintains a buffer ofbuffer_size
elements, and randomly selects the next element from that buffer (replacing it with the next input element, if one is available). Changing the value ofbuffer_size
affects how uniform the shuffling is: ifbuffer_size
is greater than the number of elements in the dataset, you get a uniform shuffle; if it is1
then you get no shuffling at all. For very large datasets, a typical "good enough" approach is to randomly shard the data into multiple files once before training, then shuffle the filenames uniformly, and then use a smaller shuffle buffer. However, the appropriate choice will depend on the exact nature of your training job.Actually the answer by @olivier-moindrot is not correct.
You can verify it by creating filenames and labels as he/she mention and print the shuffle values.
You will see each shuffle procedure will generate sample randomly with the size equals to buffer size from the dataset.
As mentioned above, @olivier-moindrot answer is not correct. For example.
and I got the following output:
the key idea behind the buffer is , always keep buffer_size elements in memory. Once you randomly get a sample(batch) from buffer, you put next batch elements inside the buffer and sample form new buffer again.