Why is TensorFlow Lite slower than TensorFlow on d

I'm currently working on Single Image Superresolution and I've managed to freeze an existing checkpoint file and convert it into tensorflow lite. However, when performing inference using the .tflite file, the time taken to upsample one image is at least 4 times that when restoring the model using the .ckpt file.

Inference using the .ckpt file is done using session.run(), while inference using the .tflite file is done using interpreter.invoke(). Both operations were done on a Ubuntu 18 VM running on a typical PC.

What I did to find out more about the issue is to run top in a seperate terminal to see the CPU utilization rate when either operations are performed. Utilization rate hits 270% with the .ckpt file, but stays at around 100% with the .tflite file.

interpreter.set_tensor(input_details[0]['index'], input_image_reshaped)
interpreter.set_tensor(input_details[1]['index'], input_bicubic_image_reshaped)
start = time.time()
interpreter.invoke()
end = time.time()

y = self.sess.run(self.y_, feed_dict={self.x: image.reshape(1, image.shape[0], image.shape[1], ch), self.x2: bicubic_image.reshape(1, self.scale * image.shape[0], self.scale * image.shape[1], ch), self.dropout: 1.0, self.is_training: 0})

One hypothesis is that tensorflow lite is not configured for multithreading, and another is that tensorflow lite is optimized for ARM processors (rather than an Intel one that my computer runs on) and thus it is slower. However, I cannot tell for sure and neither do I know how to trace the root of the issue - hopefully someone out there will be more knowledgeable about this?

Yes, the current TensorFlow Lite op kernels are optimized for ARM processor (using NEON instruction set). If SSE is available, it will try to use NEON_2_SSE to adapt NEON calls to SSE, so it should be still running with some sort of SIMD. However we didn't put much effort to optimize this code path.

Regarding number of threads. There is a SetNumThreads function in C++ API, but it's not exposed in Python API (yet). When it's not set, the underlying implementation may try to probe number of available cores. If you build the code by yourself, you can try to change the value and see if it affects the result.

Hope these helps.