Context
It is known that, at the moment, TF's Record documentation leaves something to be desired.
My question is in regards to what is optimal for storing:
- a sequence,
- its per-element class probabilities, and
- some (context?) information (e.g. name of the sequence)
as a TF Record.
Namely, this questions considers storing the sequence and class probabilities as channels vs as a byte string and whether or not the meta information should go in as features of a tf.train.Example
or as the context of a tf.train.SequenceExample
. (see questions at the bottom).
M.W.E.
For example, lets assume my looks sequence like this
seq = [
# el1, el2
[ 0, 1 ], # channel 1
[ 0, 1 ] # channel 2
]
i.e. it is a 2 channel sequence of fixed length (in this example, 2) where the values can only be integer value.
and that we have three classes for which we are trying to segment the sequence into
cls_probs = [
#cls1, cls2, cls3
[0 , 0.9 , 0.1 ], # class probabilities element 1
[0 , 0.1 , 0.9 ] # class probabilities element 2
]
where in effect both seq
and cls_probs
are numpy.array
s.
The network only requires this information. However, I also have some meta data which I would like to keep with the sequence.
e.g.
meta = {
'name': 'my_seq', # safer to keep this with the data rather than as file name
'meta_val_1': 100, # not used by network, but may be useful when evaluating network's predictions for this particular sequence
'meta_val_2': 10
}
Making TF Record
tf.train.Example
Then I have several ways I could construct my tf.train.Example
:
as channels
example = tf.train.Example(
features = tf.train.Features(
feature = {
'channel_1': tf.train.Feature(int64_list=tf.train.Int64List(value=seq[:,0])),
'channel_2': tf.train.Feature(int64_list=tf.train.Int64List(value=seq[:,1])),
'class_1' : tf.train.Feature(float_list=tf.train.FloatList(value=cls_probs[:,0])),
'class_2' : tf.train.Feature(float_list=tf.train.FloatList(value=cls_probs[:,1])),
'class_3' : tf.train.Feature(float_list=tf.train.FloatList(value=cls_probs[:,2])),
'name' : tf.train.Feature(bytes_list=tf.train.BytesList(value=[f'{meta["name"]}'.encode('utf-8')])),
# should these be FloatList even though it is just a single value?
# should these be included here if they are not used by the network?
'val_1' : tf.train.Feature(float_list=tf.train.FloatList(value=[f'{meta["meta_val_1"]}'])),
'val_2' : tf.train.Feature(float_list=tf.train.FloatList(value=[f'{meta["meta_val_2"]}'])),
})
)
where f'{variable}'.encode('utf-8')
is the currently not suggested fb'<string>'
(note: f-strings
only work with python3.6+).
This format is somewhat nice as each sequence channel is explicit. However it is also verbose and requires preprocessing when loaded to be feed into the network.
as string
or, I could dump my array to an string
example = tf.train.Example(
features = tf.train.Features(
feature = {
'sequence' : tf.train.Feature(bytes_list=tf.train.BytesList(value=seq.tostring())),
'cls_probs': tf.train.Feature(bytes_list=tf.train.BytesList(value=cls_probs.tostring())),
# ... see encoding of meta values from above
})
)
tf.train.SequenceExample
TF Records also accept another form: tf.train.SequenceExample
. SequenceExample
expects context features and an ordered list of unnamed features.
as channels
So restructuring above's as channels example:
example = tf.train.SequenceExample(
context = tf.train.Features(
feature = {
'Name' : tf.train.Feature(bytes_list=tf.train.BytesList(value=[f'{meta["name"]}'.encode('utf-8')])),
'Val_1': tf.train.Feature(float_list=tf.train.FloatList(value=[f'{meta["meta_val_1"]}'])),
'Val_2': tf.train.Feature(float_list=tf.train.FloatList(value=[f'{meta["meta_val_2"]}'])),
}
),
feature_lists = tf.train.FeatureLists(
feature_list = {
'sequence': tf.train.FeatureList(
feature = [
tf.train.Feature(int64_list=tf.train.Int64List(value=seq[:,0])),
tf.train.Feature(int64_list=tf.train.Int64List(value=seq[:,1])),
]
),
'class_probabilities': tf.train.FeatureList(
feature = [
tf.train.Feature(float_list=tf.train.FloatList(value=cls_probs[:,0])),
tf.train.Feature(float_list=tf.train.FloatList(value=cls_probs[:,1])),
tf.train.Feature(float_list=tf.train.FloatList(value=cls_probs[:,2]))
]
)
}
)
)
as string
likewise we can create the as string example:
example = tf.train.SequenceExample(
context = tf.train.Features(
# see above
),
feature_lists = tf.train.FeatureLists(
feature_list = {
'sequence': tf.train.FeatureList(
feature = [
tf.train.Feature(bytes_list=tf.train.BytesList(value=seq.tostring()))
]
),
'class_probabilities': tf.train.FeatureList(
feature = [
tf.train.Feature(bytes_list=tf.train.BytesList(value=cls_probs.tostring()))
]
)
}
)
)
Questions
Here I gave a M.W.E. for how one could construct an example (ready to be exported to a TF Record) as both tf.train.Example
and tf.train.SequenceExample
. Further, I demonstrated both how to do this per channel or by dumping as a byte string. Both of these methods (as channels / as strings) include the meta information within the example.
Thus my questions are:
which way (as channels / as string) of storage is more optimal (for read, write, re-use, etc) ?
given the meta information which should be kept with the example, is better to use
tf.train.Example
and store the meta information as features there? or usetf.train.SequenceExample
and store the meta information in the context argument?
Does anyone know if there are any notable advantages / disadvantages for any of four these strategies?
For those who would like to test this on larger less dummy like data, some functions for producing this code can be found below
Lastly, I would like to point out this medium post which greatly elaborates on TF's docs.
Note: this is not an answer to the question (whether Example or SequenceExample is better and whether or not a sequence should be broken down into channels or as a byte string)
Rather, it occurred to me whilst looking at TensorFlow Records tutorials, posts, videos, etc that most examples (that I encountered) focused on constructing the (Sequence)Example with concrete data and did not show how one could be made more dynamically. Thus I encapsulated the four methods above for converting data of the described type in the example.
While still tied to the data we are trying to create an (Sequence)Example around, hopefully for those who are still somewhat confused about the format - in addition to the concrete examples above - this might be of use.
Here is some code to play around with. Feedback is welcome.
Update
This has been condensed into a package named Feature Input / Output (FIO).
Here is a Colab demonstrating how to use it.
Namely, it introduces the concept of a
"schema"
:which allows you to define your data _once_ rather than twice (once to encode into example, and once to extract from a record).
Original
Setup
Some light helper functions
SequenceRecords
Dummy data
Initiate and go
This is an extension to my first answer which some may find useful.
Rather than considering the encoding, I here consider the opposite, e.g. how one retrieves the data from a TFRecord.
The colab can be found here.
In essence I survey 10 ways of encoding an array / array of arrays.
There are more ways to do this.
In short, with the exception of 8, I was able to 'recover' (write to tf.record and read back the data).
However, it should be noted that for methods 7 and 10, the retrieved array is flattened.