We have Core2 machines (Dell T5400) with XP64.
We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or 2.4GByte/s with the Intel compiler CRT's memcpy). While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which should be using 128-bit wide load-stores regardless of 32/64-bitness of the process) demonstrates similar upper limits on the copy bandwidth it achieves.
My question is, what's this difference actually due to ? Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what ?
Thanks for any insight.
Also raised on Intel forums.
Here's an example of a memcpy routine geared specifically for 64 bit architecture.
The full article is here: http://www.godlikemouse.com/2008/03/04/optimizing-memcpy-routines/
Of course, you really need to look at the actual machine instructions that are being executed inside the innermost loop of the memcpy, by stepping into the machine code with a debugger. Anything else is just speculation.
My quess is that it probably doesn't have anything to do with 32-bit versus 64-bit per se; my guess is that the faster library routine was written using SSE non-temporal stores.
If the inner loop contains any variation of conventional load-store instructions, then the destination memory must be read into the machine's cache, modified, and written back out. Since that read is totally unnecessary -- the bits being read are overwritten immediately -- you can save half the memory bandwidth by using the "non-temporal" write instructions, which bypass the caches. That way, the destination memory is just written making a one-way trip to the memory instead of a round trip.
I don't know the Intel compiler's CRT library, so this is just a guess. There's no particular reason why the 32-bit libCRT can't do the same thing, but the speedup you quote is in the ballpark of what I would expect just by converting the movdqa instructions to movnt...
Since memcpy is not doing any calculations, it's always bound by how fast you can read and write memory.
I don't have a reference in front of me, so I'm not absolutely positive on the timings/instructions, but I can still give the theory. If you're doing a memory move under 32-bit mode, you'll do something like a "rep movsd" which moves a single 32-bit value every clock cycle. Under 64-bit mode, you can do a "rep movsq" which does a single 64-bit move every clock cycle. That instruction is not available to 32-bit code, so you'd be doing 2 x rep movsd (at 1 cycle a piece) for half the execution speed.
VERY much simplified, ignoring all the memory bandwidth/alignment issues, etc, but this is where it all begins...
I finally got to the bottom of this (and Die in Sente's answer was on the right lines, thanks)
In the below, dst and src are 512 MByte std::vector. I'm using the Intel 10.1.029 compiler and CRT.
On 64bit both
and
where N is previously declared
const size_t N=512*(1<<20);
callthe bulk of which consists of:
and runs at ~2200 MByte/s.
But on 32bit
calls
the bulk of which consists of
and runs at ~1350 MByte/s only.
HOWEVER
where N is previously declared
const size_t N=512*(1<<20);
compiles (on 32bit) to a direct call to athe bulk of which consists of
and runs at ~2100MByte/s (and proving 32bit isn't somehow bandwidth limited).
I withdraw my claim that my own memcpy-like SSE code suffers from a similar ~1300 MByte/limit in 32bit builds; I now don't have any problems getting >2GByte/s on 32 or 64bit; the trick (as the above results hint) is to use non-temporal ("streaming") stores (e.g
_mm_stream_ps
intrinsic).It seems a bit strange that the 32bit "
dst.size()
" memcpy doesn't eventually call the faster "movnt
" version (if you step into memcpy there is the most incredible amount ofCPUID
checking and heuristic logic e.g comparing number of bytes to be copied with cache size etc before it goes anywhere near your actual data) but at least I understand the observed behaviour now (and it's not SysWow64 or H/W related).My off-the-cuff guess is that the 64 bit processes are using the processor's native 64-bit memory size, which optimizes the use of the memory bus.
Thanks for the positive feedback! I think I can partly explain what's going here.
Using the non-temporal stores for memcpy is definitely the fasted if you're only timing the memcpy call.
On the other hand, if you're benchmarking an application, the movdqa stores have the benefit that they leave the destination memory in cache. Or at least the part of it that fits into cache.
So if you're designing a runtime library and if you can assume that the application that called memcpy is going to use the destination buffer immediately after the memcpy call, then you'll want to provide the movdqa version. This effectively optimizes out the trip from memory back into the cpu that would follow the movntdq version, and all of the instructions following the call will run faster.
But on the other hand, if the destination buffer is large compared to the processor's cache, that optimization doesn't work and the movntdq version would give you faster application benchmarks.
So the idea memcpy would have multiple versions under the hood. When the destination buffer is small compared to the processor's cache, use movdqa, otherwise, then the destination buffer is large compared to the processor's cache, use movntdq. It sounds like this is what's happening in the 32-bit library.
Of course, none of this has anything to do with the differences between 32-bit and 64-bit.
My conjecture is that the 64-bit library just isn't as mature. The developers just haven't gotten around to providing both routines in that version of library yet.