Is there a way to use tensorflow map_fn on GPU?

I have a tensor A with shape [a,n] and I need to perform an op my_op with another tensor B of shape [b,n] such that the resulting tensor C has shape [a,b].

In other words: For each subtensor in A (A[0], A1,...A[n]) I need to perform an element wise op with each subtensor in B.

So the resulting tensor would contain the following:

[ [ A[0] op B[0] , A[0] op B[1], ... , A[0] op B[b] ],
  [ A[1] op B[0] , A[1] op B[1], ... , A[1] op B[b] ],
  [ ...                                             ],
  [ A[a] op B[0] , A[a] op B[1], ... , A[a] op B[b] ] ]

The only way that I've been able to find that achieves this is through nested use of tf.map_fn Thus:

import tensorflow as tf
import time
import numpy as np

a_size = 64
b_size = 256*256
n = 256
A = tf.placeholder(tf.float32,[a_size,n])
B = tf.placeholder(tf.float32,[b_size,n])

def elementwise_op(a,b):
    return tf.reduce_sum(tf.multiply(a,b))

def intermediate_op(sub_a,my_b):
    sample_values = tf.map_fn(lambda x: elementwise_op(sub_a,x),my_b)
    return sample_values

my_op = tf.map_fn(lambda x: intermediate_op(x,B),A)

with tf.Session() as sess:
    a = np.random.rand(a_size,n)
    b = np.random.rand(b_size,n)
    start_time = time.time()
    result = sess.run (my_op,feed_dict={A:a,B:b})
    print ("exec time: " ,time.time()-start_time)
    print (result.shape)

The code above runs fine, however, it does not use the GPU very well (only ~15% utilization, according to nvidia-smi). In fact, it runs an order of magnitude faster when using only the CPU!! (on my 12 core machine) When run using the GPU, I see very low GPU utilization (~15%) and 100% on one of my CPU cores. When run on the CPU only, I see 100% utilization across all CPU cores.

Average timing of 5 CPU only runs: 11.33s

Average timing of 5 GPU runs: 111.88s

The above test was run using the official Tensorflow docker images: tensorflow/tensorflow:latest-py3 (for CPU) and tensorflow/tensorflow:latest-gpu-py3 (for GPU)

My guess is that map_fn, via the python lambda, is forcing data to be copied back and forth between the CPU and GPU at every iteration, and the nested nature of the op just makes it worse. The comments in unanswered SO question here suggest that this is the case.

This article claims that:

lambda expression is the main reason of low GPU utilization.

So my question is: Is there a way to force map_fn to use the GPU? Or to avoid the Python lambda?

Alternatively, is there some other (perhaps more tensorflow-y) way to achieve the result described above, in order to the get graph to run on the GPU?

Edit: After running the profiler (I had to drastically reduce the size of the arrays to get the profiler to run at all, because it was eating up RAM like crazy), the following lines caught my attention:

node name     |     output bytes     |      total execution time     | accelerator execution time     |     cpu execution time

Mul                    1.02KB (22.23%, 0.29%),      195.07ms (85.00%, 13.06%),       5.29ms (100.00%, 25.79%),      189.78ms (84.79%, 12.89%)

Sum                      256B (21.41%, 0.07%),      241.48ms (69.08%, 16.17%),        6.01ms (74.21%, 29.29%),      235.47ms (69.01%, 15.99%)

TensorArrayScatterV3      512B (0.64%, 0.15%),      658.31ms (46.87%, 44.09%),        9.19ms (44.80%, 44.80%),      649.12ms (46.90%, 44.08%)

It looks like certain ops are being done mostly on the CPU, and only on one thread at that!

The tf.map_fn() construct can be used with a function that runs ops on GPU. By default, TensorFlow will try to run as much of the function as possible on the GPU, and any GPU-incompatible ops will run on the CPU. In your program, the entire elementwise_op() function is built from GPU-compatible ops, so there should be no additional copying between CPU and GPU at each iteration.

The cause of low GPU utilization is difficult to determine from a program fragment. For example, if A and B are relatively small, and you are feeding them from Python and the immediately fetching back the result, it is likely that the overhead of copying the initial data to and from the GPU would dominate. The best way to track this down is to use a GPU profiler, which you can get using tfprof or the NVIDIA Visual Profiler.