I've become interested in writing a memcpy()
as an educational exercise. I won't write a whole treatise of what I did and didn't think about, but here's
some guy's implementation:
__forceinline // Since Size is usually known,
// most useless code will be optimized out
// if the function is inlined.
void* myMemcpy(char* Dst, const char* Src, size_t Size)
{
void* start = Dst;
for ( ; Size >= sizeof(__m256i); Size -= sizeof(__m256i) )
{
__m256i ymm = _mm256_loadu_si256(((const __m256i* &)Src)++);
_mm256_storeu_si256(((__m256i* &)Dst)++, ymm);
}
#define CPY_1B *((uint8_t * &)Dst)++ = *((const uint8_t * &)Src)++
#define CPY_2B *((uint16_t* &)Dst)++ = *((const uint16_t* &)Src)++
#define CPY_4B *((uint32_t* &)Dst)++ = *((const uint32_t* &)Src)++
#if defined _M_X64 || defined _M_IA64 || defined __amd64
#define CPY_8B *((uint64_t* &)Dst)++ = *((const uint64_t* &)Src)++
#else
#define CPY_8B _mm_storel_epi64((__m128i *)Dst, _mm_loadu_si128((const __m128i *)Src)), ++(const uint64_t* &)Src, ++(uint64_t* &)Dst
#endif
#define CPY16B _mm_storeu_si128((__m128i *)Dst, _mm_loadu_si128((const __m128i *)Src)), ++(const __m128i* &)Src, ++(__m128i* &)Dst
switch (Size) {
case 0x00: break;
case 0x01: CPY_1B; break;
case 0x02: CPY_2B; break;
case 0x03: CPY_1B; CPY_2B; break;
case 0x04: CPY_4B; break;
case 0x05: CPY_1B; CPY_4B; break;
case 0x06: CPY_2B; CPY_4B; break;
case 0x07: CPY_1B; CPY_2B; CPY_4B; break;
case 0x08: CPY_8B; break;
case 0x09: CPY_1B; CPY_8B; break;
case 0x0A: CPY_2B; CPY_8B; break;
case 0x0B: CPY_1B; CPY_2B; CPY_8B; break;
case 0x0C: CPY_4B; CPY_8B; break;
case 0x0D: CPY_1B; CPY_4B; CPY_8B; break;
case 0x0E: CPY_2B; CPY_4B; CPY_8B; break;
case 0x0F: CPY_1B; CPY_2B; CPY_4B; CPY_8B; break;
case 0x10: CPY16B; break;
case 0x11: CPY_1B; CPY16B; break;
case 0x12: CPY_2B; CPY16B; break;
case 0x13: CPY_1B; CPY_2B; CPY16B; break;
case 0x14: CPY_4B; CPY16B; break;
case 0x15: CPY_1B; CPY_4B; CPY16B; break;
case 0x16: CPY_2B; CPY_4B; CPY16B; break;
case 0x17: CPY_1B; CPY_2B; CPY_4B; CPY16B; break;
case 0x18: CPY_8B; CPY16B; break;
case 0x19: CPY_1B; CPY_8B; CPY16B; break;
case 0x1A: CPY_2B; CPY_8B; CPY16B; break;
case 0x1B: CPY_1B; CPY_2B; CPY_8B; CPY16B; break;
case 0x1C: CPY_4B; CPY_8B; CPY16B; break;
case 0x1D: CPY_1B; CPY_4B; CPY_8B; CPY16B; break;
case 0x1E: CPY_2B; CPY_4B; CPY_8B; CPY16B; break;
case 0x1F: CPY_1B; CPY_2B; CPY_4B; CPY_8B; CPY16B; break;
}
#undef CPY_1B
#undef CPY_2B
#undef CPY_4B
#undef CPY_8B
#undef CPY16B
return start;
}
The comment translates as "Size is usually known as the compiler can optimize the code inline out most useless".
I would like to improve, if possible, on this implementation - but maybe there isn't much to improve. I see it uses SSE/AVX for the larger chunks of memory, then instead of a loop over the last < 32 bytes does the equivalent of manual unrolling, with some tweaking. So, here are my questions:
- Why unroll the loop for the last several bytes, but not partially unroll the first (and now single) loop?
- What about alignment issues? Aren't they important? Should I handle the first several bytes up to some alignment quantum differently, then perform the 256-bit ops on aligned sequences of bytes? And if so, how do I determine the appropriate alignment quantum?
- What's the most important missing feature in this implementation (if any)?
Features/Principles mentioned in the answers so far
- You should
__restrict__
your parameters. (@chux) - The memory bandwidth is a limiting factor; measure your implementation against it.(@Zboson)
- For small arrays, you can expect to approach the memory bandwidth; for larger arrays - not as much. (@Zboson)
- Multiple threads (may be | are) necessary to saturate the memory bandwidth. (@Zboson)
- It is probably wise to optimize differently for large and small copy sizes. (@Zboson)
- (Alignment is important? Not explicitly addressed!)
- The compiler should be made more explicitly aware of "obvious facts" it can use for optimization (such as the fact that Size < 32 after the first loop). (@chux)
- There are arguments for unrolling your SSE/AVX calls (@BenJackson, here), and arguments against doing so (@PaulR)
- non-temporal transfers (with which you tell the CPU you don't need it to cache the target location) should be useful for copying larger buffers. (@Zboson)
I have been studying measuring memory bandwidth for Intel processors with various operations and one of them is
memcpy
. I have done this on Core2, Ivy Bridge, and Haswell. I did most of my tests using C/C++ with intrinsics (see the code below - but I'm currently rewriting my tests in assembly).To write your own efficient
memcpy
function it's important to know what the absolute best bandwidth possible is. This bandwidth is a function of the size of the arrays which will be copied and therefore an efficientmemcpy
function needs to optimize differently for small and big (and maybe in between). To keep things simple I have optimized for small arrays of 8192 bytes and large arrays of 1 GB.For small arrays the maximum read and write bandwidth for each core is:
This is the benchmark you should aim for small arrays. For my tests I assume the arrays are aligned to 64-bytes and that the array size is a multiple of
8*sizeof(float)*unroll_factor
. Here are my currentmemcpy
results for a size of 8192 bytes (Ubuntu 14.04, GCC 4.9, EGLIBC 2.19):The
asmlib
is Agner Fog's asmlib. Thecopy_unroll1
andcopy_unroll8
functions are defined below.From this table we can see that the GCC builtin
memcpy
does not work well on Core2 and thatmemcpy
in EGLIBC does not work well on Core2 or Haswell. I did check out a head version of GLIBC recently and the performance was much better on Haswell. In all cases unrolling gets the best result.}
Where
VECNF().LOAD
is_mm_load_ps()
for SSE or_mm256_load_ps()
for AVX,VECNF().STORE
is_mm_store_ps()
for SSE or_mm256_store_ps()
for AVX, and JUMP is 4 for SSE or 8 for AVX.For the large size the best result is obtained by using non-temporal store instructions and by using multiple threads. Contrary to what many people may believe a single thread does NOT usually saturate the memory bandwidth.
Where
stream
is_mm_stream_ps()
for SSE or_mm256_stream_ps()
for AVXHere are the
memcpy
results on my E5-1620@3.6 GHz with four threads for 1 GB with a maximum main memory bandwidth of 51.2 GB/s.Once again EGLIBC performs poorly. This is because it does not use non-temporal stores.
I modfied the
eglibc
andasmlib
memcpy
functions to run in parallel like thisA general
memcpy
function needs to account for arrays which are not aligned to 64 bytes (or even to 32 or to 16 bytes) and where the size is not a multiple of 32 bytes or the unroll factor. Additionally, a decision has to be made as to when to use non-temporal stores. The general rule of thumb is to only use non-temporal stores for sizes larger than half the largest cache level (usually L3). But theses are "second order" details which I think should be dealt with after optimizing for ideal cases of large and small. There's not much point in worrying about correcting for misalignment or non-ideal size multiples if the ideal case performs poorly as well.Update
Based on comments by Stephen Canon I have learned that on Ivy Bridge and Haswell it's more efficient to use
rep movsb
thanmovntdqa
(a non-temporal store instruction). Intel calls this enhanced rep movsb (ERMSB). This is described in the Intel Optimization manuals in the section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB).Additionally, in Agner Fog's Optimizing Subroutines in Assembly manual in section 17.9 Moving blocks of data (All processors) he writes:
"There are several ways of moving large blocks of data. The most common methods are:
A general
memcpy
should consider each of these points. Additionally, with Ivy Bridge and Haswell it seems that point 1 is better than point 6 for large arrays. Different techniques are necessary for Intel and AMD and for each iteration of technology. I think it's clear that writing your own general efficientmemcpy
function can be quite complicated. But in the special cases I have looked at I have already managed to do better than the GCC builtinmemcpy
or the one in EGLIBC so the assumption that you can't do better than the standard libraries is incorrect.The question can't be answered precisely without some additional details such as:
Still, I can point out a couple things that are likely to be sub-optimal for at least some combination of the above parameters.
32-case Switch Statement
The 32-case switch statement is a cute way of handling the trailing 0 to 31 bytes, and likely benchmarks very well - but may perform badly in the real world due to at least two factors.
Code Size
This switch statement alone takes several hundred bytes of code for the body, in addition to a 32-entry lookup table needed to jump to the correct location for each length. The cost of this isn't going to show up in a focused benchmark of
memcpy
on a full-sized CPU because everything still fit in the fastest cache level: but in the real world you execute other code too and there is contention for the uop cache and L1 data and instruction caches.That many instructions may take fully 20% of the effective size of your uop cache3, and uop cache misses (and the corresponding cache-to-legacy encoder transition cycles) could easily wipe the small benefit given by this elaborate switch.
On top of that, the switch requires a 32-entry, 256 byte lookup table for the jump targets4. If you ever get a miss to DRAM on that lookup, you are talking a penalty of 150+ cycles: how many non-misses do you need to then to make the
switch
worth it, given it's probably saving a few or two at the most? Again, that won't show up in a microbenchmark.For what its worth, this
memcpy
isn't unusual: that kind of "exhaustive enumeration of cases" is common even in optimized libraries. I can conclude that either their development was driven mostly by microbenchmarks, or that it is still worth it for a large slice of general purpose code, despite the downsides. That said, there are certainly scenarios (instruction and/or data cache pressure) where this is suboptimal.Branch Prediction
The switch statement relies on a single indirect branch to choose among the alternatives. This going to be efficient to the extent that the branch predictor can predict this indirect branch, which basically means that the sequence of observed lengths needs to be predictable.
Because it is an indirect branch, there are more limits on the predictability of the branch than a conditional branch since there are a limited number of BTB entries. Recent CPUs have made strides here, but it is safe to say that if the series of lengths fed to
memcpy
don't follow a simple repeating pattern of a short period (as short as 1 or 2 on older CPUs), there will be a branch-mispredict on each call.This issue is particularly insidious because it is likely to hurt you the most in real-world in exactly the situations where a microbenchmark shows the
switch
to be the best: short lengths. For very long lengths, the behavior on the trailing 31 bytes isn't very important since it is dominated by the bulk copy. For short lengths, theswitch
is all-important (indeed, for copies of 31 bytes or less it is all that executes)!For these short lengths, a predictable series of lengths works very well for the
switch
since the indirect jump is basically free. In particular, a typicalmemcpy
benchmark "sweeps" over a series of lengths, using the same length repeatedly for each sub-test to report the results for easy graphing of "time vs length" graphs. Theswitch
does great on these tests, often reporting results like 2 or 3 cycles for small lengths of a few bytes.In the real world, your lengths might be small but unpredicable. In that case, the indirect branch will frequently mispredict5, with a penalty of ~20 cycles on modern CPUs. Compared to best case of a couple cycles it is an order of magnitude worse. So the glass jaw here can be very serious (i.e., the behavior of the
switch
in this typical case can be an order of magnitude worse than the best, whereas at long lengths, you are usually looking at a difference of 50% at most between different strategies).Solutions
So how can you do better than the above, at least under the conditions where the
switch
falls apart?Use Duff's Device
One solution to the code size issue is to combine the switch cases together, duff's device-style.
For example, the assembled code for the length 1, 3 and 7 cases looks like:
Length 1
Length 3
Length 7
This can combined into a single case, with various jump-ins:
The labels don't cost anything, and they combine the cases together and removes two out of 3
ret
instructions. Note that the basis forrsi
andrcx
have changed here: they point to the last byte to copy from/to, rather than the first. That change is free or very cheap depending on the code before the jump.You can extend that for longer lengths (e.g., you can attach lengths 15 and 31 to the chain above), and use other chains for the missing lengths. The full exercise is left to the reader. You can probably get a 50% size reduction alone from this approach, and much better if you combine it with something else to collapse the sizes from 16 - 31.
This approach only helps with the code size (and possibly the jump table size, if you shrink the size as described in 4 and you get under 256 bytes, allowing a byte-sized lookup table. It does nothing for predictability.
Overlapping Stores
One trick that helps for both code size and predictability is to use overlapping stores. That is,
memcpy
of 8 to 15 bytes can be accomplished in a branch-free way with two 8-byte stores, with the second store partly overlapping the first. For example, to copy 11 bytes, you would do an 8-byte copy at relative position0
and11 - 8 == 3
. Some of the bytes in the middle would be "copied twice", but in practice this is fine since an 8-byte copy is the same speed as a 1, 2 or 4-byte one.The C code looks like:
... and the corresponding assembly is not problematic:
In particular, note that you get exactly two loads, two stores and one
and
(in addition to thecmp
andjmp
whose existence depends on how you organize the surrounding code). That's already tied or better than most of the compiler-generated approaches for 8-15 bytes, which might use up to 4 load/store pairs.Older processors suffered some penalty for such "overlapping stores", but newer architectures (the last decade or so, at least) seem to handle them without penalty6. This has two main advantages:
The behavior is branch free for a range of sizes. Effectively, this quantizes the branching so that many values take the same path. All sizes from 8 to 15 (or 8 to 16 if you want) take the same path and suffer no misprediction pressure.
At least 8 or 9 different cases from the
switch
are subsumed into a single case with a fraction of the total code size.This approach can be combined with the
switch
approach, but using only a few cases, or it can be extended to larger sizes with conditional moves that could do, for example, all moves from 8 to 31 bytes without branches.What works out best again depends on the branch distribution, but overall this "overlapping" technique works very well.
Alignment
The existing code doesn't address alignment.
In fact, it isn't, in general, legal or C or C++, since the
char *
pointers are simply casted to larger types and dereferenced, which is not legal - although in practice it generates codes that works on today's x86 compilers (but in fact would fail for platform with stricter alignment requirements).Beyond that, it is often better to handle the alignment specifically. There are three main cases:
A
that can be added to both the source and destination such that both are aligned.The existing algorithm will work ok in case (1). It is potentially missing a large optimization the case of (2) since small intro loop could turn an unaligned copy into an aligned one.
It is also likely performing poorly in case (3), since in general in the totally misaligned case you can chose to either align the destination or the source and then proceed "semi-aligned".
The alignment penalties have been getting smaller over time and on the most recent chips are modest for general purpose code but can still be serious for code with many loads and stores. For large copies, it probably doesn't matter too much since you'll end up DRAM bandwidth limited, but for smaller copies misalignment may reduce throughput by 50% or more.
If you use NT stores, alignment can also be important, because many of the NT store instructions perform poorly with misaligned arguments.
No unrolling
The code is not unrolled and compilers unrolled by different amounts by default. Clearly this is suboptimal since among two compilers with different unroll strategies, at most one will be best.
The best approach (at least for known platform targets) is determine which unroll factor is best, and then apply that in the code.
Furthermore, the unrolling can often be combined in a smart way with the "intro" our "outro" code, doing a better job than the compiler could.
Known sizes
The primary reason that it is tough to beat the "builtin"
memcpy
routine with modern compilers is that compilers don't just call a librarymemcpy
whenevermemcpy
appears in the source. They know the contract ofmemcpy
and are free to implement it with a single inlined instruction, or even less7, in the right scenario.This is especially obvious with known lengths in
memcpy
. In this case, if the length is small, compilers will just insert a few instructions to perform the copy efficiently and in-place. This not only avoids the overhead of the function call, but all the checks about size and so on - and also generates at compile time efficient code for the copy, much like the bigswitch
in the implementation above - but without the costs of theswitch
.Similarly, the compiler knows a lot of about the alignment of structures in the calling code, and can create code that deals efficiently with alignment.
If you just implement a
memcpy2
as a library function, that is tough to replicate. You can get part of the way there my splitting the method into a small and big part: the small part appears in the header file, and does some size checks and potentially just calls the existingmemcpy
if the size is small or delegates to the library routine if it is large. Through the magic of inlining, you might get to the same place as the builtinmemcpy
.Finally, you can also try tricks with
__builtin_constant_p
or equivalents to handle the small, known case efficiently.1 Note that I'm drawing a distinction here between the "distribution" of sizes - e.g., you might say _uniformly distributed between 8 and 24 bytes - and the "predictability" of the actual sequence of sizes (e.g., do the sizes have a predicable pattern)? The question of predictability somewhat subtle because it depends on on the implementation, since as described above certain implementations are inherently more predictable.
2 In particular, ~750 bytes of instructions in
clang
and ~600 bytes ingcc
for the body alone, on top of the 256-byte jump lookup table for the switch body which had 180 - 250 instructions (gcc
andclang
respectively). Godbolt link.3 Basically 200 fused uops out of an effective uop cache size of 1000 instructions. While recent x86 have had uop cache sizes around ~1500 uops, you can't use it all outside of extremely dedicated padding of your codebase because of the restrictive code-to-cache assignment rules.
4 The switch cases have different compiled lengths, so the jump can't be directly calculated. For what it's worth, it could have been done differently: they could have used a 16-bit value in the lookup table at the cost of not using memory-source for the
jmp
, cutting its size by 75%.5 Unlike conditional branch prediction, which has a typical worst-case prediction rate of ~50% (for totally random branches), a hard-to-predict indirect branch can easily approach 100% since you aren't flipping a coin, you are choosing for an almost infinite set of branch targets. This happens in the real-world: if
memcpy
is being used to copy small strings with lengths uniformly distributed between 0 and 30, theswitch
code will mispredict ~97% of the time.6 Of course, there may be penalties for misaligned stores, but these are also generally small and have been getting smaller.
7 For example, a
memcpy
to the stack, followed by some manipulation and a copy somewhere else may be totally eliminated, directly moving the original data to its final location. Even things likemalloc
followed bymemcpy
can be totally eliminated.Taking Benefits of The ERMSB
Please also consider using REP MOVSB for larger blocks.
As you know, since first Pentium CPU produced in 1993, Intel began to make simple commands faster and complex commands (like REP MOVSB) slower. So, REP MOVSB became very slow, and there was no more reason to use it. In 2013, Intel decided to revisit REP MOVSB. If the CPU has CPUID ERMSB (Enhanced REP MOVSB) bit, then REP MOVSB commands are executed differently than on older processors, and are supposed to be fast. On practice, it is only fast for large blocks, 256 bytes and larger, and only when certain conditions are met:
See the Intel Manual on Optimization, section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
Intel recommends using AVX for blocks smaller than 2048 bytes. For the larger blocks, Intel recommends using REP MOVSB. This is because high initial startup costs of REP MOVSB (about 35 cycles).
I have done speed tests, and for the blocks of than 2048 bytes and higher, the performance of REP MOVSB is unbeatable. However, for blocks smaller than 256 bytes, REP MOVSB is very slow, even slower than plain MOV RAX back and forth in a loop.
Please not that ERMSB only affects MOVSB, not MOVSD (MOVSQ), so MOVSB is little bit faster than MOVSD (MOVSQ).
So, you can use AVX for your memcpy() implementation, and if the block is larger than 2048 bytes and all the conditions are met, then call REP MOVSB - so your memcpy() implementation will be unbeatable.
Taking Benefits of The Out-of-Order Execution Engine
You can also read about The Out-of-Order Execution Engine in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf section the 2.1.2, and take benefits of it.
For example, in Intel SkyLake processor series (launched in 2015), it has:
So we can occupy above units (3+4) in parallel if we use register-only operations. We cannot use 3+4 instructions in parallel for memory copy. We can use simultaneously maximum of up to two 32-bytes instructions to load from memory and one 32-bytes instructions to store from memory, and even if we are working with Level-1 cache.
Please see the Intel manual again to understand on how to do the fastest memcpy implementation: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
Section 2.2.2 (The Out-of-Order Engine of the Haswelll microarchitecture): "The Scheduler controls the dispatch of micro-ops onto the dispatch ports. There are eight dispatch ports to support the out-of-order execution core. Four of the eight ports provided execution resources for computational operations. The other 4 ports support memory operations of up to two 256-bit load and one 256-bit store operation in a cycle."
Section 2.2.4 (Cache and Memory Subsystem) has the following note: "First level data cache supports two load micro-ops each cycle; each micro-op can fetch up to 32-bytes of data."
Section 2.2.4.1 (Load and Store Operation Enhancements) has the following information: The L1 data cache can handle two 256-bit (32 bytes) load and one 256-bit (32 bytes) store operations each cycle. The unified L2 can service one cache line (64 bytes) each cycle. Additionally, there are 72 load buffers and 42 store buffers available to support micro-ops execution in-flight.
The other sections (2.3 and so on, dedicated to Sandy Bridge and other microarchitectures) basically reiterate the above information.
The section 2.3.4 (The Execution Core) gives additional details.
The scheduler can dispatch up to six micro-ops every cycle, one on each port. The following table summarizes which operations can be dispatched on which port.
The section 2.3.5.1 (Load and Store Operation Overview) may also be useful to understand on how to make fast memory copy, as well as the section 2.4.4.1 (Loads and Stores).
For the other processor architectures, it is again - two load units and one store unit. Table 2-4 (Cache Parameters of the Skylake Microarchitecture) has the following information:
Peak Bandwidth (bytes/cyc):
I have also done speed tests on my Intel Core i5 6600 CPU (Skylake, 14nm, released in September 2015) with DDR4 memory, and this has confirmed the teory. For example, my test have shown that using generic 64-bit registers for memory copy, even many registers in parallel, degrades performance. Also, using just 2 XMM registers is enough - adding the 3rd doesn't add performance.
If your CPU has AVX CPUID bit, you may take benefits of the large, 256-bit (32 byte) YMM registers to copy memory, to occupy two full load units. The AVX support was first introduced by Intel with the Sandy Bridge processors, shipping in Q1 2011 and later on by AMD with the Bulldozer processor shipping in Q3 2011.
Also, there is speed benefit if you loop-unroll this code at least 8 times. As I wrote before, adding more registers besides ymm0 and ymm1 doesn't increase performance, because there are just two load units and one store unit. Adding loops like "dec r9 jnz @@again" degrades the performance, but simple "add ecx/edx" does not.
Finally, if your CPU has AVX-512 extension, you can use 512-bit (64-byte) registers to copy memory:
AVX-512 is supported by the following processors: Xeon Phi x200, released in 2016; Skylake EP/EX Xeon "Purley" (Xeon E5-26xx V5) processors (H2 2017); Cannonlake processors (H2 2017), Skylake-X processors - Core i9-7×××X, i7-7×××X, i5-7×××X - released on June 2017.
Please note that the memory have to be aligned on the size of the registers that you are using. If it is not, please use "unaligned" instructions: vmovdqu and moveups.
Firstly the main loop uses unaligned AVX vector loads/stores to copy 32 bytes at a time, until there are < 32 bytes left to copy:
Then the final switch statement handles the residual 0..31 bytes in as efficient manner as possible, using a combination of 8/4/2/1 byte copies as appropriate. Note that this is not an unrolled loop - it's just 32 different optimised code paths which handle the residual bytes using the minimum number of loads and stores.
As for why the main 32 byte AVX loop is not manually unrolled - there are several possible reasons for this:
[*] note that the last two comments above apply to cases where source and/or destination are not in cache (i.e. writing/reading to/from DRAM), and therefore load/store latency is high.