I have a tensor A with shape [a,n] and I need to perform an op my_op
with another tensor B of shape [b,n] such that the resulting tensor C has shape [a,b].
In other words: For each subtensor in A (A[0], A1,...A[n]) I need to perform an element wise op with each subtensor in B.
So the resulting tensor would contain the following:
[ [ A[0] op B[0] , A[0] op B[1], ... , A[0] op B[b] ],
[ A[1] op B[0] , A[1] op B[1], ... , A[1] op B[b] ],
[ ... ],
[ A[a] op B[0] , A[a] op B[1], ... , A[a] op B[b] ] ]
The only way that I've been able to find that achieves this is through nested use of tf.map_fn Thus:
import tensorflow as tf
import time
import numpy as np
a_size = 64
b_size = 256*256
n = 256
A = tf.placeholder(tf.float32,[a_size,n])
B = tf.placeholder(tf.float32,[b_size,n])
def elementwise_op(a,b):
return tf.reduce_sum(tf.multiply(a,b))
def intermediate_op(sub_a,my_b):
sample_values = tf.map_fn(lambda x: elementwise_op(sub_a,x),my_b)
return sample_values
my_op = tf.map_fn(lambda x: intermediate_op(x,B),A)
with tf.Session() as sess:
a = np.random.rand(a_size,n)
b = np.random.rand(b_size,n)
start_time = time.time()
result = sess.run (my_op,feed_dict={A:a,B:b})
print ("exec time: " ,time.time()-start_time)
print (result.shape)
The code above runs fine, however, it does not use the GPU very well (only ~15% utilization, according to nvidia-smi
). In fact, it runs an order of magnitude faster when using only the CPU!! (on my 12 core machine) When run using the GPU, I see very low GPU utilization (~15%) and 100% on one of my CPU cores. When run on the CPU only, I see 100% utilization across all CPU cores.
Average timing of 5 CPU only runs: 11.33s
Average timing of 5 GPU runs: 111.88s
The above test was run using the official Tensorflow docker images: tensorflow/tensorflow:latest-py3
(for CPU) and tensorflow/tensorflow:latest-gpu-py3
(for GPU)
My guess is that map_fn
, via the python lambda, is forcing data to be copied back and forth between the CPU and GPU at every iteration, and the nested nature of the op just makes it worse. The comments in unanswered SO question here suggest that this is the case.
This article claims that:
lambda expression is the main reason of low GPU utilization.
-
So my question is: Is there a way to force map_fn to use the GPU? Or to avoid the Python lambda?
Alternatively, is there some other (perhaps more tensorflow-y) way to achieve the result described above, in order to the get graph to run on the GPU?
Edit: After running the profiler (I had to drastically reduce the size of the arrays to get the profiler to run at all, because it was eating up RAM like crazy), the following lines caught my attention:
node name | output bytes | total execution time | accelerator execution time | cpu execution time
Mul 1.02KB (22.23%, 0.29%), 195.07ms (85.00%, 13.06%), 5.29ms (100.00%, 25.79%), 189.78ms (84.79%, 12.89%)
Sum 256B (21.41%, 0.07%), 241.48ms (69.08%, 16.17%), 6.01ms (74.21%, 29.29%), 235.47ms (69.01%, 15.99%)
TensorArrayScatterV3 512B (0.64%, 0.15%), 658.31ms (46.87%, 44.09%), 9.19ms (44.80%, 44.80%), 649.12ms (46.90%, 44.08%)
It looks like certain ops are being done mostly on the CPU, and only on one thread at that!