Summary:
memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies?
Full details:
As part of a data capture application (using some specialized hardware), I need to copy about 3 GB/sec from temporary buffers into main memory. To acquire data, I provide the hardware driver with a series of buffers (2MB each). The hardware DMAs data to each buffer, and then notifies my program when each buffer is full. My program empties the buffer (memcpy to another, larger block of RAM), and reposts the processed buffer to the card to be filled again. I am having issues with memcpy moving the data fast enough. It seems that the memory-to-memory copy should be fast enough to support 3GB/sec on the hardware that I am running on. Lavalys EVEREST gives me a 9337MB/sec memory copy benchmark result, but I can't get anywhere near those speeds with memcpy, even in a simple test program.
I have isolated the performance issue by adding/removing the memcpy call inside the buffer processing code. Without the memcpy, I can run full data rate- about 3GB/sec. With the memcpy enabled, I am limited to about 550Mb/sec (using current compiler).
In order to benchmark memcpy on my system, I've written a separate test program that just calls memcpy on some blocks of data. (I've posted the code below) I've run this both in the compiler/IDE that I'm using (National Instruments CVI) as well as Visual Studio 2010. While I'm not currently using Visual Studio, I am willing to make the switch if it will yield the necessary performance. However, before blindly moving over, I wanted to make sure that it would solve my memcpy performance problems.
Visual C++ 2010: 1900 MB/sec
NI CVI 2009: 550 MB/sec
While I am not surprised that CVI is significantly slower than Visual Studio, I am surprised that the memcpy performance is this low. While I'm not sure if this is directly comparable, this is much lower than the EVEREST benchmark bandwidth. While I don't need quite that level of performance, a minimum of 3GB/sec is necessary. Surely the standard library implementation can't be this much worse than whatever EVEREST is using!
What, if anything, can I do to make memcpy faster in this situation?
Hardware details: AMD Magny Cours- 4x octal core 128 GB DDR3 Windows Server 2003 Enterprise X64
Test program:
#include <windows.h>
#include <stdio.h>
const size_t NUM_ELEMENTS = 2*1024 * 1024;
const size_t ITERATIONS = 10000;
int main (int argc, char *argv[])
{
LARGE_INTEGER start, stop, frequency;
QueryPerformanceFrequency(&frequency);
unsigned short * src = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);
unsigned short * dest = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);
for(int ctr = 0; ctr < NUM_ELEMENTS; ctr++)
{
src[ctr] = rand();
}
QueryPerformanceCounter(&start);
for(int iter = 0; iter < ITERATIONS; iter++)
memcpy(dest, src, NUM_ELEMENTS * sizeof(unsigned short));
QueryPerformanceCounter(&stop);
__int64 duration = stop.QuadPart - start.QuadPart;
double duration_d = (double)duration / (double) frequency.QuadPart;
double bytes_sec = (ITERATIONS * (NUM_ELEMENTS/1024/1024) * sizeof(unsigned short)) / duration_d;
printf("Duration: %.5lfs for %d iterations, %.3lfMB/sec\n", duration_d, ITERATIONS, bytes_sec);
free(src);
free(dest);
getchar();
return 0;
}
EDIT: If you have an extra five minutes and want to contribute, can you run the above code on your machine and post your time as a comment?
You can write a better implementation of memcpy using SSE2 registers. The version in VC2010 does this already. So the question is more, if you are handing it aligned memory.
Maybe you can do better then the version of VC 2010, but it does need some understanding, of how to do it.
PS: You can pass the buffer to the user mode program in an inverted call, to prevent the copy altogether.
I'm not sure if it's done in run time or if you have to do it compile time, but you should have SSE or similar extensions enabled as the vector unit often can write 128 bits to the memory compared to 64 bits for the CPU.
Try this implementation.Yeah, and make sure that both the source and destination is aligned to 128 bits. If your source and destination are not aligned respective to each other your memcpy() will have to do some serious magic. :)