Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software?
If so, why gcc doesn't generate simd instructions for these library functions by default.
Also, are there any other functions can be possibly improved by SIMD?
Yes, these functions are much faster with SSE instructions. It would be nice if your runtime library/compiler instrinsics would include optimized versions, but that doesn't seem to be pervasive.
I have a custom SIMD memchr
which is a hell-of-a-lot faster than the library version. Especially when I'm finding the first of 2 or 3 characters (example, I want to know if there's an equation in this line of text, I search for the first of =
, \n
, \r
).
On the other hand, the library functions are well tested, so it's only worth writing your own if you call them a lot and a profiler shows they're a significant fraction of your CPU time.
It probably doesn't matter. The CPU is much faster than memory bandwidth, and the implementations of memcpy
etc. provided by the compiler's runtime library are probably good enough. In "large scale" software your performance is not going to be dominated by copying memory, anyway (it's probably dominated by I/O).
To get a real step up in memory copying performance, some systems have a specialised implementation of DMA that can be used to copy from memory to memory. If a substantial performance increase is needed, hardware is the way to get it.
It does not make sense. Your compiler ought to be emitting these instructions implicitly for memcpy/memcmp/similar intrinsics, if it is able to emit SIMD at all.
You may need to explicitly instruct GCC to emit SSE opcodes with eg -msse -msse2
; some GCCs do not enable them by default. Also, if you do not tell GCC to optimize (ie, -o2
), it won't even try to emit fast code.
The use of SIMD opcodes for memory work like this can have a massive performance impact, because they also include cache prefetches and other DMA hints that are important for optimizing bus access. But that doesn't mean that you need to emit them manually; even though most compiler stink at emitting SIMD ops generally, every one I've used at least handles them for the basic CRT memory functions.
Basic math functions can also benefit a lot from setting the compiler to SSE mode. You can easily get an 8x speedup on basic sqrt()
just by telling the compiler to use the SSE opcode instead of the terrible old x87 FPU.
on x86 hardware, it should not matter much, with out-of-order processing. Processor will achieve necessary ILP and try to issue max number of load/store operations per cycle for memcpy, whether it be SIMD or Scalar instruction set.