Tensorflow 1.10 TFRecordDataset - recovering TFRec

2019-04-06 14:20发布

问题:

Notes:

  1. this question extends upon a previous question of mine. In that question I ask about the best way to store some dummy data as Example and SequenceExample seeking to know which is better for data similar to dummy data provided. I provide both explicit formulations of the Example and SequenceExample construction as well as, in the answers, a programatic way to do so.

  2. Because this is still a lot of code, I am providing a Colab (interactive jupyter notebook hosted by google) file where you can try the code out yourself to assist. All the necessary code is there and it is generously commented.

I am trying to learn how to convert my data into TF Records as the claimed benefits are worthwhile for my data. However, the documentation leaves a lot to be desired and the tutorials / blogs (that I have seen) which try to go deeper, really only touch the surface or rehash the sparse docs that exist.

For the demo data considered in my previous question - as well as here - I have written a decent class that takes:

  • a sequence with n channels (in this example it is integer based, of fixed-length and with n channels)
  • soft-labeled class probabilities (in this example there are n classes and float based)
  • some meta data (in this example a string and two floats)

and can encode the data in 1 of 6 forms:

  1. Example, with sequence channels / classes separate in a numeric type (int64 in this case) with meta data tacked on
  2. Example, with sequence channels / classes separate as a byte string (via numpy.ndarray.tostring()) with meta data tacked on
  3. Example, with sequence / classes dumped as byte string with meta data tacked on

  4. SequenceExample, with sequence channels / classes separate in a numeric type and meta data as context

  5. SequenceExample, with sequence channels separate as a byte string and meta data as context
  6. SequenceExample, with sequence and classes dumped as byte string and meta data as context

This works fine.

In the Colab I show how to write dummy data all in the same file as well as in separate files.

My question is how can I recover this data?

I given 4 attempts at trying to do so in the linked file.

Why is TFReader under a different sub-package from TFWriter?

回答1:

Solved by updating the features to include shape information and remembering that SequenceExample are unnamed FeatureLists.

context_features = {
    'Name' : tf.FixedLenFeature([], dtype=tf.string),
    'Val_1': tf.FixedLenFeature([], dtype=tf.float32),
    'Val_2': tf.FixedLenFeature([], dtype=tf.float32)
}

sequence_features = {
    'sequence': tf.FixedLenSequenceFeature((3,), dtype=tf.int64),
    'pclasses'  : tf.FixedLenSequenceFeature((3,), dtype=tf.float32),
}

def parse(record):
  parsed = tf.parse_single_sequence_example(
        record,
        context_features=context_features,
        sequence_features=sequence_features
  )
  return parsed


filenames = [os.path.join(os.getcwd(),f"dummy_sequences_{i}.tfrecords") for i in range(3)]
dataset = tf.data.TFRecordDataset(filenames).map(lambda r: parse(r))

iterator = tf.data.Iterator.from_structure(dataset.output_types,
                                           dataset.output_shapes)
next_element = iterator.get_next()

training_init_op = iterator.make_initializer(dataset)

for _ in range(2):
  # Initialize an iterator over the training dataset.
  sess.run(training_init_op)
  for _ in range(3):
    ne = sess.run(next_element)
    print(ne)