I am reading the code in TensorFlow benchmarks repo. The following piece of code is the part that creates TensorFlow dataset from TFRecord files:
ds = tf.data.TFRecordDataset.list_files(tfrecord_file_names)
ds = ds.apply(interleave_ops.parallel_interleave(tf.data.TFRecordDataset, cycle_length=10))
I am trying to change this code to create dataset directly from JPEG image files:
ds = tf.data.Dataset.from_tensor_slices(jpeg_file_names)
ds = ds.apply(interleave_ops.parallel_interleave(?, cycle_length=10))
I don't know what to write in the ? place. The map_func in parallel_interleave() is __init__() of tf.data.TFRecordDataset class for TFRecord files, but I don't know what to write for JPEG files.
We don't need to do any transformations here. Because we will zip two datasets and then do the transformations later. The code is as follows:
counter = tf.data.Dataset.range(batch_size)
ds = tf.data.Dataset.zip((ds, counter))
ds = ds.apply( \
batching.map_and_batch( \
map_func=preprocess_fn, \
batch_size=batch_size, \
num_parallel_batches=num_splits))
Because we don't need transformation in ? place, I tried to use an empty map_func, but there is error "map_funcmust return a
Dataset` object". I also tried to use tf.data.Dataset, but the output says Dataset is an abstract class that is not allowed to put there.
Anyone can help this? Thanks very much.
parallel_interleave
is useful when you have a transformation that transforms each element of a source dataset into multiple elements into the destination dataset. I'm not sure why they use it in the benchmarks repo like that, when they could have just used amap
with parallel calls.Here's how I suggest using
parallel_interleave
for reading images from several directories, each containing one class:There are three steps. First, we get the list of directories and their labels (
#1
).Then, we map these to a dataset of files. But if we do a simple
.flatmap()
, we will end up with all the files of label0
, followed by all the files of label1
, then2
etc ... Then we'd need really large shuffle buffers to get a meaningful shuffle.So, instead, we apply
parallel_interleave
(#2
). Here is theget_files()
:Using
parallel_interleave
ensures thelist_files
of each directory is run in parallel, so by the time the firstblock_length
files are listed from the first directory, the firstblock_length
files from the 2nd directory will also be available (also from 3rd, 4th etc). Moreover, the resulting dataset will contain interleaved blocks of each label, e.g.1 1 1 1 2 2 2 2 3 3 3 3 3 1 1 1 1 ...
(for 3 classes andblock_length=4
)Finally, we read the images from the list of files (
#3
). Here isread_and_decode()
:This function takes an image path and its label and returns a tensor for each: image tensor for the path, and one_hot encoding for the label. This is also the place where you can do all the transformations on the image. Here, I do resizing and basic pre-processing.