TensorFlow - Read video frames from TFRecords file

2020-02-09 16:02发布

TLDR; my question is on how to load compressed video frames from TFRecords.

I am setting up a data pipeline for training deep learning models on a large video dataset (Kinetics). For this I am using TensorFlow, more specifically the tf.data.Dataset and TFRecordDataset structures. As the dataset contains ~300k videos of 10 seconds, there is a large amount of data to deal with. During training, I want to randomly sample 64 consecutive frames from a video, therefore fast random sampling is important. For achieving this there are a number of data loading scenarios possible during training:

  1. Sample from Video. Load the videos using ffmpeg or OpenCV and sample frames. Not ideal as seeking in videos is tricky, and decoding video streams is much slower than decoding JPG.
  2. JPG Images. Preprocess the dataset by extracting all video frames as JPG. This generates a huge amount of files, which is probably not going to be fast due to random access.
  3. Data Containers. Preprocess the dataset to TFRecords or HDF5 files. Requires more work getting the pipeline ready, but most likely to be the fastest of those options.

I have decided to go for option (3) and use TFRecord files to store a preprocessed version of the dataset. However, this is also not as straightforward as it seems, for example:

  1. Compression. Storing the video frames as uncompressed byte data in TFRecords will require a huge amount of disk space. Therefore, I extract all the video frames, apply JPG compression and store the compressed bytes as TFRecords.
  2. Video Data. We are dealing with video, so each example in the TFRecords file will be quite large and contains several video frames (typically 250-300 for 10 seconds of video, depending on the frame rate).

I have wrote the following code to preprocess the video dataset and write the video frames as TFRecord files (each of ~5GB in size):

def _int64_feature(value):
    """Wrapper for inserting int64 features into Example proto."""
    if not isinstance(value, list):
        value = [value]
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def _bytes_feature(value):
    """Wrapper for inserting bytes features into Example proto."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


with tf.python_io.TFRecordWriter(output_file) as writer:

  # Read and resize all video frames, np.uint8 of size [N,H,W,3]
  frames = ... 

  features = {}
  features['num_frames']  = _int64_feature(frames.shape[0])
  features['height']      = _int64_feature(frames.shape[1])
  features['width']       = _int64_feature(frames.shape[2])
  features['channels']    = _int64_feature(frames.shape[3])
  features['class_label'] = _int64_feature(example['class_id'])
  features['class_text']  = _bytes_feature(tf.compat.as_bytes(example['class_label']))
  features['filename']    = _bytes_feature(tf.compat.as_bytes(example['video_id']))

  # Compress the frames using JPG and store in as bytes in:
  # 'frames/000001', 'frames/000002', ...
  for i in range(len(frames)):
      ret, buffer = cv2.imencode(".jpg", frames[i])
      features["frames/{:04d}".format(i)] = _bytes_feature(tf.compat.as_bytes(buffer.tobytes()))

  tfrecord_example = tf.train.Example(features=tf.train.Features(feature=features))
  writer.write(tfrecord_example.SerializeToString())

This works fine; the dataset is nicely written as TFRecord files with the frames as compressed JPG bytes. My question regards, how to read the TFRecord files during training, randomly sample 64 frames from a video and decode the JPG images.

According to TensorFlow's documentation on tf.Data we need to do something like:

filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)  # Parse the record into tensors.
dataset = dataset.repeat()  # Repeat the input indefinitely.
dataset = dataset.batch(32)
iterator = dataset.make_initializable_iterator()
training_filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: training_filenames})

There are many example on how to do this with images, and that is quite straightforward. However, for video and random sampling of frames I am stuck. The tf.train.Features object stores the frames as frame/00001, frame/000002 etc. My first question is how to randomly sample a set of consecutive frames from this inside the dataset.map() function? Considerations are that each frame has a variable number of bytes due to JPG compression and need to be decoded using tf.image.decode_jpeg.

Any help how to best setup reading video sampels from TFRecord files would be appreciated!

2条回答
Animai°情兽
2楼-- · 2020-02-09 16:35

Since you're using very similar dependencies, I suggest to take a look at the following Python package as it addresses your exact problem setting:

pip install video2tfrecord

or refer to https://github.com/ferreirafabio/video2tfrecord. It should also be adaptable enough to use tf.data.Dataset.

disclaimer: I am one of the authors of the package.

查看更多
Melony?
3楼-- · 2020-02-09 16:37

Encoding each frame as a separate feature makes it difficult to select frames dynamically, because the signature of tf.parse_example() (and tf.parse_single_example()) requires that the set of parsed feature names be fixed at graph construction time. However, you could try encoding the frames as a single feature that contains a list of JPEG-encoded strings:

def _bytes_list_feature(values):
    """Wrapper for inserting bytes features into Example proto."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=values))

with tf.python_io.TFRecordWriter(output_file) as writer:

  # Read and resize all video frames, np.uint8 of size [N,H,W,3]
  frames = ... 

  features = {}
  features['num_frames']  = _int64_feature(frames.shape[0])
  features['height']      = _int64_feature(frames.shape[1])
  features['width']       = _int64_feature(frames.shape[2])
  features['channels']    = _int64_feature(frames.shape[3])
  features['class_label'] = _int64_feature(example['class_id'])
  features['class_text']  = _bytes_feature(tf.compat.as_bytes(example['class_label']))
  features['filename']    = _bytes_feature(tf.compat.as_bytes(example['video_id']))

  # Compress the frames using JPG and store in as a list of strings in 'frames'
  encoded_frames = [tf.compat.as_bytes(cv2.imencode(".jpg", frame)[1].tobytes())
                    for frame in frames]
  features['frames'] = _bytes_list_feature(encoded_frames)

  tfrecord_example = tf.train.Example(features=tf.train.Features(feature=features))
  writer.write(tfrecord_example.SerializeToString())

Once you have done this, it will be possible to slice the frames feature dynamically, using a modified version of your parsing code:

def decode(serialized_example, sess):
  # Prepare feature list; read encoded JPG images as bytes
  features = dict()
  features["class_label"] = tf.FixedLenFeature((), tf.int64)
  features["frames"] = tf.VarLenFeature(tf.string)
  features["num_frames"] = tf.FixedLenFeature((), tf.int64)

  # Parse into tensors
  parsed_features = tf.parse_single_example(serialized_example, features)

  # Randomly sample offset from the valid range.
  random_offset = tf.random_uniform(
      shape=(), minval=0,
      maxval=parsed_features["num_frames"] - SEQ_NUM_FRAMES, dtype=tf.int64)

  offsets = tf.range(random_offset, random_offset + SEQ_NUM_FRAMES)

  # Decode the encoded JPG images
  images = tf.map_fn(lambda i: tf.image.decode_jpeg(parsed_features["frames"].values[i]),
                     offsets)

  label  = tf.cast(parsed_features["class_label"], tf.int64)

  return images, label

(Note that I haven't been able to run your code, so there may be some small errors, but hopefully it is enough to get you started.)

查看更多
登录 后发表回答