Using tensorflow's Dataset pipeline, how do I

2019-04-29 21:32发布

问题:

I have the map function below (runnable example), which inputs a string and outputs a string and an integer.

in tf.data.Dataset.from_tensor_slices I named the original input 'filenames'. But when I return the values from the map function map_element_counts I can only return a tuple (returning a dictionary generates an exception).

Is there a way to name the 2 elements returned from my map_element_counts function?

import tensorflow as tf

filelist = ['fileA_6', 'fileB_10', 'fileC_7']

def map_element_counts(fname):
  # perform operations outside of tensorflow
  return 'test', 10

ds = tf.data.Dataset.from_tensor_slices({'filenames': filelist})
ds = ds.map(map_func=lambda x: tf.py_func(
  func=map_element_counts, inp=[x['filenames']], Tout=[tf.string, tf.int64]
))
element = ds.make_one_shot_iterator().get_next()

with tf.Session() as sess:
  print(sess.run(element))

Result:

(b'test', 10)

Desired Result:

{'elementA': b'test', 'elementB': 10)

Added detail:

When I do return {'elementA': 'test', 'elementB': 10} I get this exception:

tensorflow.python.framework.errors_impl.UnimplementedError: Unsupported object type dict

回答1:

Applying tf.py_func inside ds.map works.

I created a very simple file as example. Where I just write 10 inside.

dummy_file.txt:

10

Here for the script:

import tensorflow as tf

filelist = ['dummy_file.txt', 'dummy_file.txt', 'dummy_file.txt']


def py_func(input):
    # perform operations outside of tensorflow
    parsed_txt_file = int(input)
    return 'test', parsed_txt_file


def map_element_counts(fname):
    # let tensorflow read the text file
    file_string = tf.read_file(fname['filenames'])
    # then use python function on the extracted string
    a, b = tf.py_func(
                    func=py_func, inp=[file_string], Tout=[tf.string, tf.int64]
                    )
    return {'elementA': a, 'elementB': b, 'file': fname['filenames']}

ds = tf.data.Dataset.from_tensor_slices({'filenames': filelist})
ds = ds.map(map_element_counts)
element = ds.make_one_shot_iterator().get_next()

with tf.Session() as sess:
    print(sess.run(element))
    print(sess.run(element))
    print(sess.run(element))

Output:

{'file': b'dummy_file.txt', 'elementA': b'test', 'elementB': 10}
{'file': b'dummy_file.txt', 'elementA': b'test', 'elementB': 10}
{'file': b'dummy_file.txt', 'elementA': b'test', 'elementB': 10}


回答2:

I'm posing a final solution to this question for posterity sake. The code below is a copy/paste example that works under the most complex conditions this question addresses (note that the other two answers aren't copy/pastable code samples):

The goal of the code is:

  • Take a list of (big) files and split it into chunks (filename/index pairs)
  • Process each chunk using a map operation (generators aren't a workable solution here, see: https://github.com/tensorflow/tensorflow/issues/16343)
  • Output multiple samples from a map operation that takes only 1 file/chunk as input.
  • Maintain element naming throughout the process

Copy/pastable working sample for Tensorflow 1.5 / Python 3.x

import tensorflow as tf
import numpy as np

files = [b'testA', b'testB', b'testC']

def mymap1(x):
  result_tensors = tf.py_func(func=mymap2, inp=[x], Tout=[tf.string, tf.int64])
  return {'filename': result_tensors[0], 'value': result_tensors[1]}

def mymap2(x):
  return np.array([x, x, x]), np.array([10, 20, 30])

def myflatmap(named_elements):
  return tf.data.Dataset.zip({
    'filename': tf.data.Dataset.from_tensor_slices(named_elements['filename']),
    'value': tf.data.Dataset.from_tensor_slices(named_elements['value'])
  })

ds = tf.data.Dataset.from_tensor_slices(files)
ds = ds.map(map_func=mymap1)
ds = ds.flat_map(map_func=myflatmap)

element = ds.make_one_shot_iterator().get_next()

with tf.Session() as sess:
  for _ in range(9):
    print(sess.run(element))

Output:

{'filename': b'testA', 'value': 10}
{'filename': b'testA', 'value': 20}
{'filename': b'testA', 'value': 30}
{'filename': b'testB', 'value': 10}
{'filename': b'testB', 'value': 20}
{'filename': b'testB', 'value': 30}
{'filename': b'testC', 'value': 10}
{'filename': b'testC', 'value': 20}
{'filename': b'testC', 'value': 30}


回答3:

There's no need for tf.py_func in this case, because map_func of Dataset#map works with dictionaries and other structures:

map_func: A function mapping a nested structure of tensors (having shapes and types defined by self.output_shapes and self.output_types) to another nested structure of tensors.

Here's an example:

import tensorflow as tf

filelist = ['fileA_6', 'fileB_10', 'fileC_7']

def map_element_counts(fnames):
  return {'elementA': b'test', 'elementB': 10, 'file': fnames['filenames']}

ds = tf.data.Dataset.from_tensor_slices({'filenames': filelist})
ds = ds.map(map_func=map_element_counts)
element = ds.make_one_shot_iterator().get_next()

with tf.Session() as sess:
  print(sess.run(element))
  print(sess.run(element))
  print(sess.run(element))

Output:

{'elementA': 'test', 'elementB': 10, 'file': 'fileA_6'}
{'elementA': 'test', 'elementB': 10, 'file': 'fileB_10'}
{'elementA': 'test', 'elementB': 10, 'file': 'fileC_7'}