Using multiple CPU cores in TensorFlow

I have extensively studied other answers on TensorFlow and I just cannot seem to get it to use multiple cores on my CPU.

According to htop, the following program only uses a single CPU core:

import tensorflow as tf

n_cpus = 20

sess = tf.Session(config=tf.ConfigProto(
    device_count={ "CPU": n_cpus },
    inter_op_parallelism_threads=n_cpus,
    intra_op_parallelism_threads=1,
))

size = 100000

A = tf.ones([size, size], name="A")
B = tf.ones([size, size], name="B")
C = tf.ones([size, size], name="C")

with tf.device("/cpu:0"):
    x = tf.matmul(A, B)
with tf.device("/cpu:1"):
    y = tf.matmul(A, C)

sess.run([x, y])

# run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
# run_metadata = tf.RunMetadata()
# sess.run([x, y], options=run_options, run_metadata=run_metadata)

# for device in run_metadata.step_stats.dev_stats:
#     device_name = device.device
#     print(device.device)
#     for node in device.node_stats:
#         print("   ", node.node_name)

However, when I uncomment the lines at the bottom, and change size so that the computation actually finishes in a reasonable amount of time, I see that TensorFlow seems to think it's using at least 2 CPU devices:

/job:localhost/replica:0/task:0/device:CPU:0
    _SOURCE
    MatMul
    _retval_MatMul_0_0
    _retval_MatMul_1_0_1
/job:localhost/replica:0/task:0/device:CPU:1
    _SOURCE
    MatMul_1

Fundamentally, what I want to do here is execute different ops on different cores in parallel. I don't want to split a single op over multiple cores, though I know that happens to work in this contrived example. Both device_count and inter_op_parallelism_threads sound like what I want, but neither seems to actually result in using multiple cores. I've tried all combinations I can think of, including setting one or the other to 1 in case they conflict with each other, and nothing seems to work.

I can also confirm with taskset that I'm not doing anything strange with my CPU affinity:

$ taskset -p $$
pid 21395's current affinity mask: ffffffffff

What exactly do I have to do to this code to get it to use multiple CPU cores?

Note:

From this answer among others I'm setting the device_count and inter_op_parallelism_threads.
The tracing command comes from this answer.
I can remove the tf.device calls and it doesn't seem to make any difference to my CPU utilization.

I'm using TensorFlow 1.10.0 installed from conda.

After some back and forth on the TensorFlow issue here we determined that the issue was that the program was being "optimized" by a constant folding pass, because the inputs were all trivial. It turns out this constant folding pass runs sequentially. Therefore, if you want to observe a parallel execution, the way to do this is to make the inputs non-trivial so that the constant folding won't apply to them. The method suggested in the issue was to use tf.placeholder, and I have written an example program that makes use of this here:

https://gist.github.com/elliottslaughter/750a27c832782f4daec8686281027de8

See the original issue for sample output from the program: https://github.com/tensorflow/tensorflow/issues/22619