Tensorflow v1.10: store images as byte strings or

2019-02-19 22:48发布

Context

It is known that, at the moment, TF's Record documentation leaves something to be desired.

My question is in regards to what is optimal for storing:

  • a sequence,
  • its per-element class probabilities, and
  • some (context?) information (e.g. name of the sequence)

as a TF Record.

Namely, this questions considers storing the sequence and class probabilities as channels vs as a byte string and whether or not the meta information should go in as features of a tf.train.Example or as the context of a tf.train.SequenceExample. (see questions at the bottom).

M.W.E.

For example, lets assume my looks sequence like this

seq = [ 
        # el1, el2 
        [ 0,   1   ], # channel 1
        [ 0,   1   ]  # channel 2
      ]

i.e. it is a 2 channel sequence of fixed length (in this example, 2) where the values can only be integer value.

and that we have three classes for which we are trying to segment the sequence into

cls_probs = [ 
        #cls1, cls2, cls3
        [0   , 0.9 , 0.1 ], # class probabilities element 1
        [0   , 0.1 , 0.9 ]  # class probabilities element 2
      ]

where in effect both seq and cls_probs are numpy.arrays.

The network only requires this information. However, I also have some meta data which I would like to keep with the sequence.

e.g.

meta = {
           'name': 'my_seq',  # safer to keep this with the data rather than as file name
           'meta_val_1': 100, # not used by network, but may be useful when evaluating network's predictions for this particular sequence
           'meta_val_2': 10
       }

Making TF Record

tf.train.Example

Then I have several ways I could construct my tf.train.Example:

as channels

example = tf.train.Example(
    features = tf.train.Features(
        feature = {
            'channel_1': tf.train.Feature(int64_list=tf.train.Int64List(value=seq[:,0])),
            'channel_2': tf.train.Feature(int64_list=tf.train.Int64List(value=seq[:,1])),
            'class_1'  : tf.train.Feature(float_list=tf.train.FloatList(value=cls_probs[:,0])),
            'class_2'  : tf.train.Feature(float_list=tf.train.FloatList(value=cls_probs[:,1])),
            'class_3'  : tf.train.Feature(float_list=tf.train.FloatList(value=cls_probs[:,2])),
            'name'     : tf.train.Feature(bytes_list=tf.train.BytesList(value=[f'{meta["name"]}'.encode('utf-8')])), 
            # should these be FloatList even though it is just a single value?
            # should these be included here if they are not used by the network?
            'val_1'    : tf.train.Feature(float_list=tf.train.FloatList(value=[f'{meta["meta_val_1"]}'])),
            'val_2'    : tf.train.Feature(float_list=tf.train.FloatList(value=[f'{meta["meta_val_2"]}'])),
    })
)

where f'{variable}'.encode('utf-8') is the currently not suggested fb'<string>' (note: f-strings only work with python3.6+).

This format is somewhat nice as each sequence channel is explicit. However it is also verbose and requires preprocessing when loaded to be feed into the network.

as string

or, I could dump my array to an string

example = tf.train.Example(
    features = tf.train.Features(
        feature = {
            'sequence' : tf.train.Feature(bytes_list=tf.train.BytesList(value=seq.tostring())),
            'cls_probs': tf.train.Feature(bytes_list=tf.train.BytesList(value=cls_probs.tostring())),
            # ... see encoding of meta values from above
    })
)

tf.train.SequenceExample

TF Records also accept another form: tf.train.SequenceExample. SequenceExample expects context features and an ordered list of unnamed features.

as channels

So restructuring above's as channels example:

example = tf.train.SequenceExample(
    context = tf.train.Features(
        feature = {
            'Name' : tf.train.Feature(bytes_list=tf.train.BytesList(value=[f'{meta["name"]}'.encode('utf-8')])), 
            'Val_1': tf.train.Feature(float_list=tf.train.FloatList(value=[f'{meta["meta_val_1"]}'])),
            'Val_2': tf.train.Feature(float_list=tf.train.FloatList(value=[f'{meta["meta_val_2"]}'])),
        }
    ),
    feature_lists = tf.train.FeatureLists(
        feature_list = {
            'sequence': tf.train.FeatureList(
                feature = [
                    tf.train.Feature(int64_list=tf.train.Int64List(value=seq[:,0])),
                    tf.train.Feature(int64_list=tf.train.Int64List(value=seq[:,1])),
                ]
            ),
            'class_probabilities': tf.train.FeatureList(
                feature = [
                    tf.train.Feature(float_list=tf.train.FloatList(value=cls_probs[:,0])),
                    tf.train.Feature(float_list=tf.train.FloatList(value=cls_probs[:,1])),
                    tf.train.Feature(float_list=tf.train.FloatList(value=cls_probs[:,2]))
                ]
            )
        }
    )
)

as string

likewise we can create the as string example:

example = tf.train.SequenceExample(
    context = tf.train.Features(
        # see above
    ),
    feature_lists = tf.train.FeatureLists(
        feature_list = {
            'sequence': tf.train.FeatureList(
                feature = [
                    tf.train.Feature(bytes_list=tf.train.BytesList(value=seq.tostring()))
                ]
            ),
            'class_probabilities': tf.train.FeatureList(
                feature = [
                    tf.train.Feature(bytes_list=tf.train.BytesList(value=cls_probs.tostring()))
                ]
            )
        }
    )
)

Questions

Here I gave a M.W.E. for how one could construct an example (ready to be exported to a TF Record) as both tf.train.Example and tf.train.SequenceExample. Further, I demonstrated both how to do this per channel or by dumping as a byte string. Both of these methods (as channels / as strings) include the meta information within the example.

Thus my questions are:

  1. which way (as channels / as string) of storage is more optimal (for read, write, re-use, etc) ?

  2. given the meta information which should be kept with the example, is better to use tf.train.Example and store the meta information as features there? or use tf.train.SequenceExample and store the meta information in the context argument?

Does anyone know if there are any notable advantages / disadvantages for any of four these strategies?

For those who would like to test this on larger less dummy like data, some functions for producing this code can be found below

Lastly, I would like to point out this medium post which greatly elaborates on TF's docs.

2条回答
小情绪 Triste *
2楼-- · 2019-02-19 23:13

Note: this is not an answer to the question (whether Example or SequenceExample is better and whether or not a sequence should be broken down into channels or as a byte string)

Rather, it occurred to me whilst looking at TensorFlow Records tutorials, posts, videos, etc that most examples (that I encountered) focused on constructing the (Sequence)Example with concrete data and did not show how one could be made more dynamically. Thus I encapsulated the four methods above for converting data of the described type in the example.

While still tied to the data we are trying to create an (Sequence)Example around, hopefully for those who are still somewhat confused about the format - in addition to the concrete examples above - this might be of use.

Here is some code to play around with. Feedback is welcome.

Update

This has been condensed into a package named Feature Input / Output (FIO).

Here is a Colab demonstrating how to use it.

Namely, it introduces the concept of a "schema":

SCHEMA = {
    'my-feature': {'length': 'fixed', 'dtype': tf.string,  'shape': []},
    'seq': {
        'length': 'fixed',
        'dtype': tf.int64,
        'shape': [4, 3],
        'encode': 'channels',
        'channel_names': ['A', 'B', 'C'],
        'data_format': 'channels_last'
    }
}

which allows you to define your data _once_ rather than twice (once to encode into example, and once to extract from a record).


Original

Setup

import os, sys, json
sys.path.insert(0, '../')
import tensorflow as tf
import numpy as np

Some light helper functions

def list_like_q(value) -> bool:
    '''
    TensorFlow tf.train.Feature requires a list of feature values.
    Many values used in practice are either python lists or numpy.ndarrays.
    We often have features which consist of a singular value.
    For brevity, we define some light helper functions to wrap a list as a
    tf.train.Feature. This lets us test if we need to wrap the value.
    '''
    # import numpy as np
    return (type(value) is list or type(value) is np.ndarray)


def take_all() -> slice: return slice(None, None, None)
def take_channel(sequence, channel:int, data_format:str='channels_last'):
    slices = [channel, take_all()]
    if data_format != 'channels_last': slices.reverse()
    return sequence[tuple(slices)]

def number_of_channels(sequence, data_format:str='channels_last') -> int:
    return sequence.shape[-1] if data_format == 'channels_last' else sequence.shape[0]

def feature_int64(value):
    '''Takes value and wraps into tf.train.Feature(Int64List)'''
    if not list_like_q(value): value = [value]
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def feature_float(value):
    '''Takes value and wraps into tf.train.Feature(FloatList)'''
    if not list_like_q(value): value = [value]
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))

def feature_bytes(value):
    '''Takes value and wraps is into tf.train.Feature(BytesList).'''
    if type(value) is np.ndarray: value = value.tostring()
    if type(value) is not bytes:  value = str(value).encode('utf-8')
    if type(value) is not list:   value = [value]
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))

def feature_function(dtype):
    '''
    Given <dtype> returns the function for wrapping a value into the
    corresponding tf.train.Feature
    '''
    return feature_int64 if dtype == "int64" else \
           feature_float if dtype == "float" else \
           feature_bytes

def feature_list(iterable, dtype:str='float'):
    '''Given an iterable, returns the feature list of corresponding <dtype>.'''
    return tf.train.FeatureList([feature_function(dtype)(item) for item in iterable])


# the next three for completeness
def feature_list_int64(value):
    return tf.train.FeatureList(feature=feature_list(value, 'int64'))

def feature_list_float(value):
    return tf.train.FeatureList(feature=feature_list(value, 'float'))

def feature_list_bytes(value):
    return tf.train.FeatureList(feature=feature_list(value, 'bytes'))



def dict_to_features(values:dict, types:dict) -> dict:
    '''
    Given <types>, maps over name:dtype pairs and wraps <values>[name] in the
    corresponding  feature type.
    '''
    return {name: feature_function(dtype)(values[name]) for name, dtype in types.items()}

def features_from_dict(values:dict, types:dict):
    return tf.train.Features(feature=dict_to_features(values, types))

def default_channel_names(sequence, data_format:str='channels_last') -> list:
    '''Ensures a naming scheme as required for channel based Example'''
    return [f'Channel {i}' for i in range(number_of_channels(sequence, data_format))]

def channels_to_features(sequence, dtype:str='float', data_format:str='channels_last', channel_names:list=None) -> dict:
    '''
    Given a <sequence> of corresponding <dtype> and <data_format>, with optional <channel_names>
    returns the dictionary of each channel:tf.train.Feature pair.
    '''
    if channel_names is None: channel_names = default_channel_names(sequence, data_format)
    return {
        channel: feature_function(dtype)(take_channel(sequence, i, data_format))
        for i, channel in enumerate(channel_names)
    }

def channels_to_feature_list(sequence, dtype:str='float', data_format:str='channels_last'):
    '''
    Given a <sequence> of <dtype> and <data_format> returns the FeatureList
    where each element corresponds to a channel of <sequence>
    '''
    return tf.train.FeatureList(feature=list(channels_to_features(sequence, dtype, data_format).values()))

SequenceRecords

class SequenceRecord:
    '''
    SequenceRecord is a supporting class built on top of the functions found in
    /model/utils/features.py with the purpose of converting our data consisting
    of:

        - a sequence of length n,
        - n vectors of class probability vectors (refered to as pclasses), and
        - metadata (name of sequence, start site, stop site, etc)

    and converting it into a TensorFlow (Sequence)Example which can
    subsequentially be written as a TensorFlow Record.

    For both Example and SequenceExample options, the channels / classes of the
    sequence / pclasses can be stored as numeric features (int64 / float) or as
    a byte string. For each of these options, the encoding can be done per
    channel / class, or the entire sequence / pclasses matrix.

    Overwrite the following class variables to suit your needs:

    _class_var           || description
    ---------------------------------------------------------------------------
    _metadata_types:dict || a dictionary of <feature-name>:<dtype> pairs which
                         || is refered to when the  metadata is converted into
                         || tf.train.Feature (only 'int64', 'float', 'bytes' are
                         || supported for <dtype>)
    _sequence_data_format|| a string specifying where the channels are. By
                         || default, this is set to 'channels_last'
    _pclasses_data_format|| a string specifying where the channels are (by
                         || default, this is set to 'channels_last')
    _sequence_data_type  || a string specifying what dtype channels should be
                         || encoded as (by default 'int64')
    _pclasses_data_type  || a string specifying what dtype channels should be
                         || encoded as (by default 'float')
    _channel_names       || a list of strings specifying the name and order
                         || channels appear in <sequence> (by default set to
                         || None)
    _classes_names       || a list of strings specifying the name and order
                         || classes appear as channels in <pclasses> (by default
                         || set to None)

    '''
    _metadata_types = {}
    _sequence_data_format = 'channels_last'
    _pclasses_data_format = 'channels_last'
    _sequence_data_type = 'int64'
    _pclasses_data_type = 'float'

    _channel_names = None
    _classes_names = None


    def make_example(self, sequence, pclasses, metadata:dict={}, form:str='example', by:str='channels'):
        '''
        The core function of SequenceRecord. Given <sequence>, <pclasses> and <metadata>
        converts them to the corresponing <form> and <by> the specified encoding schema.

        form: either 'example' (default) or 'sequence' and yields either a
              a Example or SequenceExample.
        by:   either 'channels' (default) or 'bstrings' or 'bdstring' and
              encodes the sequence / pclasses by channel / class as a numeric,
              or a byte string (options 'channels' and 'bstrings'), or dumps the
              entire numpy.ndarray a byte string (option 'bdstring')

        '''
        wrap = self.example if form == 'example' else self.sequence_example
        return wrap(sequence, pclasses, metadata, by)

    def example(self, sequence, pclasses, metadata, by='channels'):
        wrap = self.example_as_channels if by == 'channels' else \
               self.example_as_bdstring if by == 'bdstring' else \
               self.example_as_bstrings
        return wrap(sequence, pclasses, metadata)

    def sequence_example(self, sequence, pclasses, metadata, by='channels'):
        wrap = self.sequence_example_as_channels if by == 'channels' else \
               self.sequence_example_as_bdstring if by == 'bdstring' else \
               self.sequence_example_as_bstrings
        return wrap(sequence, pclasses, metadata)


    def example_as_channels(self, sequence, pclasses, metadata):
        '''
        Encoded each channel (or class) as its own feature with specified dtype
        (e.g. _sequence_data_type) and wraps in tf.train.Example
        '''
        features = {
            **dict_to_features(metadata, self._metadata_types),
            **channels_to_features(sequence, self._sequence_data_type, self._sequence_data_format, self._channel_names),
            **channels_to_features(pclasses, self._pclasses_data_type, self._pclasses_data_format, self._classes_names),
        }
        return tf.train.Example(features=tf.train.Features(feature=features))

    def example_as_bstrings(self, sequence, pclasses, metadata):
        '''
        Encoded each channel (or class) as its own feature but dumps ndarrays
        as byte strings (<np.ndarray.tostring()>) and wraps in tf.train.Example.
        '''
        features = {
            **dict_to_features(metadata, self._metadata_types),
            **channels_to_features(sequence, 'bytes', self._sequence_data_format, self._channel_names),
            **channels_to_features(pclasses, 'bytes', self._pclasses_data_format, self._classes_names),
        }
        return tf.train.Example(features=tf.train.Features(feature=features))

    def example_as_bdstring(self, sequence, pclasses, metadata):
        '''
        Encodes sequence and probability classes as a byte 'dump' string
        i.e. dump the sequence to a string and encode to bytes
        ( equivalent to np.ndarray.tostring() )
        '''
        features = {
            **dict_to_features(metadata, self._metadata_types),
            'sequence': feature_bytes(sequence),
            'pclasses': feature_bytes(pclasses)
        }
        return tf.train.Example(features=tf.train.Features(feature=features))


    def sequence_example_as_channels(self, sequence, pclasses, metadata):
        '''
        Encoded each channel (or class) as its own feature with specified dtype
        (e.g. _sequence_data_type) and wraps in tf.train.SequenceExample
        '''
        context = features_from_dict(metadata, self._metadata_types)
        feat_list = tf.train.FeatureLists(feature_list={
            'sequence': channels_to_feature_list(sequence, self._sequence_data_type, self._sequence_data_format),
            'pclasses': channels_to_feature_list(pclasses, self._pclasses_data_type, self._pclasses_data_format)
        })
        return tf.train.SequenceExample(context=context, feature_lists=feat_list)

    def sequence_example_as_bstrings(self, sequence, pclasses, metadata):
        '''
        Encoded each channel (or class) as its own feature but dumps ndarrays
        as byte strings (<np.ndarray.tostring()>) and wraps in
        tf.train.SequenceExample.
        '''
        context = features_from_dict(metadata, self._metadata_types)
        feat_list = tf.train.FeatureLists(feature_list={
            'sequence': channels_to_feature_list(sequence, 'bytes', self._sequence_data_format),
            'pclasses': channels_to_feature_list(pclasses, 'bytes', self._pclasses_data_format)
        })
        return tf.train.SequenceExample(context=context, feature_lists=feat_list)

    def sequence_example_as_bdstring(self, sequence, pclasses, metadata):
        '''
        Encodes sequence and probability classes as a byte 'dump' string
        i.e. dump the sequence to a string and encode to bytes
        ( equivalent to np.ndarray.tostring() )
        '''
        context = features_from_dict(metadata, self._metadata_types)
        feat_list = tf.train.FeatureLists(feature_list={
            'sequence': tf.train.FeatureList(feature=[feature_bytes(sequence)]),
            'pclasses': tf.train.FeatureList(feature=[feature_bytes(pclasses)])
        })
        return tf.train.SequenceExample(context=context, feature_lists=feat_list)

    def write(self, example, to:str):
        '''
        After calling corresponding method to construct (Sequence)Example,
        writes the passed (Sequence)Example to specified location (full path name).
        '''
        with tf.python_io.TFRecordWriter(to) as writer:
            writer.write(example.SerializeToString())

Dummy data

sequences = np.array([
    # sequence 1
    [
        # el1, el2, el3
        [   1,   1,  1], # channel 1
        [   2,   2,  2], # channel 2
        [   3,   3,  3], # channel 3
    ],
    #sequence 2
    [
        [  10,  10, 10], # channel 1
        [  20,  20, 20], # channel 2
        [  30,  30, 30], # channel 3
    ]
])

pclasses = np.array([
    # sequence 1
    [
        # cls1, cls2, cls3
        [    0,  0.9, 0.1], # class probabilities element 1
        [    0,  0.1, 0.9], # class probabilities element 2
        [  0.8,  0.1, 0.1]  # class probabilities element 3
    ],
    # sequence 2
    [
        # cls1, cls2, cls3
        [  0.8,  0.1, 0.1], # class probabilities element 3    
        [    0,  0.1, 0.9], # class probabilities element 2
        [    0,  0.9, 0.1]  # class probabilities element 1
    ]
])


metadata = [
    {'Name': 'sequence 1', 'Val_1': 100, 'Val_2': 10},
    {'Name': 'sequence 2', 'Val_1':  10, 'Val_2': 100}
]

metatypes = {'Name': 'bytes', 'Val_1': 'float', 'Val_2': 'float'}

Initiate and go

SequenceRecord._channel_names = ['Channel 1', 'Channel 2', 'Channel 3']
SequenceRecord._classes_names = ['Class A', 'Class B', 'Class C']
SequenceRecord._metadata_types = metatypes
SR = SequenceRecord()  


SR.make_example(sequences[0], pclasses[0], metadata[0], form='example',  by='channels')
SR.make_example(sequences[0], pclasses[0], metadata[0], form='example',  by='bstrings')
SR.make_example(sequences[0], pclasses[0], metadata[0], form='example',  by='bdstring')
SR.make_example(sequences[0], pclasses[0], metadata[0], form='sequence', by='channels')
SR.make_example(sequences[0], pclasses[0], metadata[0], form='sequence', by='bstrings')
SR.make_example(sequences[0], pclasses[0], metadata[0], form='sequence', by='bdstring')
查看更多
唯我独甜
3楼-- · 2019-02-19 23:28

This is an extension to my first answer which some may find useful.

Rather than considering the encoding, I here consider the opposite, e.g. how one retrieves the data from a TFRecord.

The colab can be found here.

In essence I survey 10 ways of encoding an array / array of arrays.

  1. Example: Int64 feature (int array)
  2. Example: Float feature (float array)
  3. Example: Bytes feature (int array dumped to byte string)
  4. SequenceExample: Int64 feature list (array of int arrays)
  5. SequenceExample: Float feature list (array of float arrays)
  6. SequenceExample: Bytes feature list (array of int arrays dumped to byte strings)
  7. Example: Bytes feature (array of int arrays all of which is dumped to byte string)
  8. SequenceExample: Bytes feature list (array of int arrays dumped to byte strings)
  9. SequenceExample: Bytes feature list (array of int arrays all of which is dumped to byte string)
  10. SequenceExample: Bytes feature list (array of int arrays, where each int is dumped to byte string)

There are more ways to do this.

In short, with the exception of 8, I was able to 'recover' (write to tf.record and read back the data).

However, it should be noted that for methods 7 and 10, the retrieved array is flattened.

查看更多
登录 后发表回答