I got really strange behavior for my x64 multithreading application. The execution time in debug mode is faster than in release mode.
I break the problem down and found the issue: The debug modus optimize (!Note optimition is off!) the memcpy to memmove, which peforms faster. The release mode use still memcpy (!note optimition is on).
This problem slows down my multithreading app in release mode. :(
Anyone any idea?
#include <time.h>
#include <iostream>
#define T_SIZE 1024*1024*2
int main()
{
clock_t start, end;
char data[T_SIZE];
char store[100][T_SIZE];
start = clock();
for (int i = 0; i < 4000; i++) {
memcpy(store[i % 100], data, T_SIZE);
}
// Debug > Release Time 1040 < 1620
printf("memcpy: %d\n", clock() - start);
start = clock();
for (int i = 0; i < 4000; i++) {
memmove(store[i % 100], data, T_SIZE);
}
// Debug > Release Time 1040 > 923
printf("memmove: %d\n", clock() - start);
}
Idea: call
memmove
, since it's fastest for your case.The following answer is valid for VS2013 ONLY
What we have here is actually stranger than just
memcpy
vs.memmove
. It's a case of the intrinsic optimization actually slowing things down. The issue stems from the fact that VS2013 inlines memcopy like thus:The issue with this is that we're doing unaligned SSE loads and stores which is actually slower than just using standard C code. I verified this by grabbing the CRTs implementation from the source code included in with visual studio and making a
my_memcpy
As a way of ensuring that the cache was warm during all of this I had preinitialized all of
data
but the results were telling:So why is
memmove
faster? Because it doesn't try to prior optimize because it must assume the data can overlap.For those curious this is my code in full:
Update
While debugging I've found that the compiler did detect that the code I copied from the CRT is
memcpy
, but it links it to the non-intrinsic version in the CRT itself which usesrep movs
instead of the massive SSE loop above. It seems the issue is ONLY with the intrinsic version.Update 2
Per Z boson in the comments it seems that this is all very architecture dependent. On my CPU
rep movsb
is faster, but on older CPUs the SSE or AVX implementation has the potential to be faster. This is per the Intel Optimization Manual. For unaligned data asrep movsb
can experience up to a 25% penalty on older hardware. However, that said, it appears that for the vast majority of cases and architecturesrep movsb
will on average beat the SSE or AVX implementation.