I would like to copy a relatively short sequence of memory (less than 1 KB, typically 2-200 bytes) in a time critical function. The best code for this on CPU side seems to be rep movsd
. However I somehow cannot make my compiler to generate this code. I hoped (and I vaguely remember seeing so) using memcpy would do this using compiler built-in intrinsics, but based on disassembly and debugging it seems compiler is using call to memcpy/memmove library implementation instead. I also hoped the compiler might be smart enough to recognize following loop and use rep movsd
on its own, but it seems it does not.
char *dst;
const char *src;
// ...
for (int r=size; --r>=0; ) *dst++ = *src++;
Is there some way to make the Visual Studio compiler to generate rep movsd
sequence other than using inline assembly?
Several questions come to mind.
First, how do you know movsd would be faster? Have you looked up its latency/throughput? The x86 architecture is full of crufty old instructions that should not be used because they're just not very efficient on modern CPU's.
Second, what happens if you use std::copy
instead of memcpy? std::copy
is potentially faster, as it can be specialized at compile-time for the specific data type.
And third, have you enabled intrinsic functions under project properties -> C/C++ -> Optimization?
Of course I assume other optimizations are enabled as well.
Are you running an optimised build? It won't use an intrinsic unless optimisation is on. Its also worth noting that it will probably use a better copy loop than rep movsd. It should try and use MMX, at the least, to perform a 64-bit at a time copy. In fact 6 or 7 years back I wrote an MMX optimised copy loop for doing this sort of thing. Unfortunately the compiler's intrinsic memcpy outperformed my MMX copy by about 1%. That really taught me not to make assumptions about what the compiler is doing.
Using memcpy with a constant size
What I have found meanwhile:
Compiler will use intrinsic when the copied block size is compile time known. When it is not, is calls the library implementation. When the size is known, the code generated is very nice, selected based on the size. It may be a single mov, or movsd, or movsd followed by movsb, as needed.
It seems that if I really want to use movsb or movsd always, even with a "dynamic" size I will have to use inline assembly or special intrinsic (see below). I know the size is "quite short", but the compiler does not know it and I cannot communicate this to it - I have even tried to use __assume(size<16), but it is not enough.
Demo code, compile with "-Ob1 (expansion for inline only):
#include <memory.h>
void MemCpyTest(void *tgt, const void *src, size_t size)
{
memcpy(tgt,src,size);
}
template <int size>
void MemCpyTestT(void *tgt, const void *src)
{
memcpy(tgt,src,size);
}
int main ( int argc, char **argv )
{
int src;
int dst;
MemCpyTest(&dst,&src,sizeof(dst));
MemCpyTestT<sizeof(dst)>(&dst,&src);
return 0;
}
Specialized intrinsics
I have found recently there exists very simple way how to make Visual Studio compiler copy characters using movsd - very natural and simple: using intrinsics. Following intrinsics may come handy:
Have you timed memcpy? On recent versions of Visual Studio, the memcpy implementation uses SSE2... which should be faster than rep movsd
. If the block you're copying is 1 KB, then it's not really a problem that the compiler isn't using an intrinsic since the time for the function call will be negligible compared to the time for the copy.
Note that in order to use movsd
, src
must point to a memory aligned to 32-bit boundary and its length must be a multiple of 4 bytes.
If it is, why does your code use char *
instead of int *
or something? If it's not, your question is moot.
If you change char *
to int *
, you might get better result from std::copy
.
Edit: have you measured that the copying is the bottleneck?
Use memcpy. This problem has already been solved.
FYI rep movsd is not always the best, rep movsb can be faster in some circumstances and with SSE and the like the best is movntq [edi], xmm0. You can even optimize further for large amount of memory in using page locality by moving data to a buffer and then moving it to your destination.