I have extensively studied other answers on TensorFlow and I just cannot seem to get it to use multiple cores on my CPU.
According to htop, the following program only uses a single CPU core:
import tensorflow as tf
n_cpus = 20
sess = tf.Session(config=tf.ConfigProto(
device_count={ "CPU": n_cpus },
inter_op_parallelism_threads=n_cpus,
intra_op_parallelism_threads=1,
))
size = 100000
A = tf.ones([size, size], name="A")
B = tf.ones([size, size], name="B")
C = tf.ones([size, size], name="C")
with tf.device("/cpu:0"):
x = tf.matmul(A, B)
with tf.device("/cpu:1"):
y = tf.matmul(A, C)
sess.run([x, y])
# run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
# run_metadata = tf.RunMetadata()
# sess.run([x, y], options=run_options, run_metadata=run_metadata)
# for device in run_metadata.step_stats.dev_stats:
# device_name = device.device
# print(device.device)
# for node in device.node_stats:
# print(" ", node.node_name)
However, when I uncomment the lines at the bottom, and change size
so that the computation actually finishes in a reasonable amount of time, I see that TensorFlow seems to think it's using at least 2 CPU devices:
/job:localhost/replica:0/task:0/device:CPU:0
_SOURCE
MatMul
_retval_MatMul_0_0
_retval_MatMul_1_0_1
/job:localhost/replica:0/task:0/device:CPU:1
_SOURCE
MatMul_1
Fundamentally, what I want to do here is execute different ops on different cores in parallel. I don't want to split a single op over multiple cores, though I know that happens to work in this contrived example. Both device_count
and inter_op_parallelism_threads
sound like what I want, but neither seems to actually result in using multiple cores. I've tried all combinations I can think of, including setting one or the other to 1
in case they conflict with each other, and nothing seems to work.
I can also confirm with taskset
that I'm not doing anything strange with my CPU affinity:
$ taskset -p $$
pid 21395's current affinity mask: ffffffffff
What exactly do I have to do to this code to get it to use multiple CPU cores?
Note:
- From this answer among others I'm setting the
device_count
andinter_op_parallelism_threads
. - The tracing command comes from this answer.
- I can remove the
tf.device
calls and it doesn't seem to make any difference to my CPU utilization.
I'm using TensorFlow 1.10.0 installed from conda.