I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy
.
ERMSB was introduced with the Ivy Bridge microarchitecture. See the section "Enhanced REP MOVSB and STOSB operation (ERMSB)" in the Intel optimization manual if you don't know what ERMSB is.
The only way I know to do this directly is with inline assembly. I got the following function from https://groups.google.com/forum/#!topic/gnu.gcc.help/-Bmlm_EG_fE
static inline void *__movsb(void *d, const void *s, size_t n) {
asm volatile ("rep movsb"
: "=D" (d),
"=S" (s),
"=c" (n)
: "0" (d),
"1" (s),
"2" (n)
: "memory");
return d;
}
When I use this however, the bandwidth is much less than with memcpy
.
__movsb
gets 15 GB/s and memcpy
get 26 GB/s with my i7-6700HQ (Skylake) system, Ubuntu 16.10, DDR4@2400 MHz dual channel 32 GB, GCC 6.2.
Why is the bandwidth so much lower with REP MOVSB
? What can I do to improve it?
Here is the code I used to test this.
//gcc -O3 -march=native -fopenmp foo.c
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <stddef.h>
#include <omp.h>
#include <x86intrin.h>
static inline void *__movsb(void *d, const void *s, size_t n) {
asm volatile ("rep movsb"
: "=D" (d),
"=S" (s),
"=c" (n)
: "0" (d),
"1" (s),
"2" (n)
: "memory");
return d;
}
int main(void) {
int n = 1<<30;
//char *a = malloc(n), *b = malloc(n);
char *a = _mm_malloc(n,4096), *b = _mm_malloc(n,4096);
memset(a,2,n), memset(b,1,n);
__movsb(b,a,n);
printf("%d\n", memcmp(b,a,n));
double dtime;
dtime = -omp_get_wtime();
for(int i=0; i<10; i++) __movsb(b,a,n);
dtime += omp_get_wtime();
printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime);
dtime = -omp_get_wtime();
for(int i=0; i<10; i++) memcpy(b,a,n);
dtime += omp_get_wtime();
printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime);
}
The reason I am interested in rep movsb
is based off these comments
Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not... rep movsb is significantly faster than movntdqa when streaming to memory on Ivybridge and Haswell (but be aware that pre-Ivybridge it is slow!)
What's missing/sub-optimal in this memcpy implementation?
Here are my results on the same system from tinymembnech.
C copy backwards : 7910.6 MB/s (1.4%)
C copy backwards (32 byte blocks) : 7696.6 MB/s (0.9%)
C copy backwards (64 byte blocks) : 7679.5 MB/s (0.7%)
C copy : 8811.0 MB/s (1.2%)
C copy prefetched (32 bytes step) : 9328.4 MB/s (0.5%)
C copy prefetched (64 bytes step) : 9355.1 MB/s (0.6%)
C 2-pass copy : 6474.3 MB/s (1.3%)
C 2-pass copy prefetched (32 bytes step) : 7072.9 MB/s (1.2%)
C 2-pass copy prefetched (64 bytes step) : 7065.2 MB/s (0.8%)
C fill : 14426.0 MB/s (1.5%)
C fill (shuffle within 16 byte blocks) : 14198.0 MB/s (1.1%)
C fill (shuffle within 32 byte blocks) : 14422.0 MB/s (1.7%)
C fill (shuffle within 64 byte blocks) : 14178.3 MB/s (1.0%)
---
standard memcpy : 12784.4 MB/s (1.9%)
standard memset : 30630.3 MB/s (1.1%)
---
MOVSB copy : 8712.0 MB/s (2.0%)
MOVSD copy : 8712.7 MB/s (1.9%)
SSE2 copy : 8952.2 MB/s (0.7%)
SSE2 nontemporal copy : 12538.2 MB/s (0.8%)
SSE2 copy prefetched (32 bytes step) : 9553.6 MB/s (0.8%)
SSE2 copy prefetched (64 bytes step) : 9458.5 MB/s (0.5%)
SSE2 nontemporal copy prefetched (32 bytes step) : 13103.2 MB/s (0.7%)
SSE2 nontemporal copy prefetched (64 bytes step) : 13179.1 MB/s (0.9%)
SSE2 2-pass copy : 7250.6 MB/s (0.7%)
SSE2 2-pass copy prefetched (32 bytes step) : 7437.8 MB/s (0.6%)
SSE2 2-pass copy prefetched (64 bytes step) : 7498.2 MB/s (0.9%)
SSE2 2-pass nontemporal copy : 3776.6 MB/s (1.4%)
SSE2 fill : 14701.3 MB/s (1.6%)
SSE2 nontemporal fill : 34188.3 MB/s (0.8%)
Note that on my system SSE2 copy prefetched
is also faster than MOVSB copy
.
In my original tests I did not disable turbo. I disabled turbo and tested again and it does not appear to make much of a difference. However, changing the power management does make a big difference.
When I do
sudo cpufreq-set -r -g performance
I sometimes see over 20 GB/s with rep movsb
.
with
sudo cpufreq-set -r -g powersave
the best I see is about 17 GB/s. But memcpy
does not seem to be sensitive to the power management.
I checked the frequency (using turbostat
) with and without SpeedStep enabled, with performance
and with powersave
for idle, a 1 core load and a 4 core load. I ran Intel's MKL dense matrix multiplication to create a load and set the number of threads using OMP_SET_NUM_THREADS
. Here is a table of the results (numbers in GHz).
SpeedStep idle 1 core 4 core
powersave OFF 0.8 2.6 2.6
performance OFF 2.6 2.6 2.6
powersave ON 0.8 3.5 3.1
performance ON 3.5 3.5 3.1
This shows that with powersave
even with SpeedStep disabled the CPU
still clocks down to the idle frequency of 0.8 GHz
. It's only with performance
without SpeedStep that the CPU runs at a constant frequency.
I used e.g sudo cpufreq-set -r performance
(because cpufreq-set
was giving strange results) to change the power settings. This turns turbo back on so I had to disable turbo after.
As a general
memcpy()
guide:a) If the data being copied is tiny (less than maybe 20 bytes) and has a fixed size, let the compiler do it. Reason: Compiler can use normal
mov
instructions and avoid the startup overheads.b) If the data being copied is small (less than about 4 KiB) and is guaranteed to be aligned, use
rep movsb
(if ERMSB is supported) orrep movsd
(if ERMSB is not supported). Reason: Using an SSE or AVX alternative has a huge amount of "startup overhead" before it copies anything.c) If the data being copied is small (less than about 4 KiB) and is not guaranteed to be aligned, use
rep movsb
. Reason: Using SSE or AVX, or usingrep movsd
for the bulk of it plus somerep movsb
at the start or end, has too much overhead.d) For all other cases use something like this:
Reason: This will be so slow that it will force programmers to find an alternative that doesn't involve copying huge globs of data; and the resulting software will be significantly faster because copying large globs of data was avoided.
This is not an answer to the stated question(s), only my results (and personal conclusions) when trying to find out.
In summary: GCC already optimizes
memset()
/memmove()
/memcpy()
(see e.g. gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() in the GCC sources; also look forstringop_algs
in the same file to see architecture-dependent variants). So, there is no reason to expect massive gains by using your own variant with GCC (unless you've forgotten important stuff like alignment attributes for your aligned data, or do not enable sufficiently specific optimizations like-O2 -march= -mtune=
). If you agree, then the answers to the stated question are more or less irrelevant in practice.(I only wish there was a
memrepeat()
, the opposite ofmemcpy()
compared tomemmove()
, that would repeat the initial part of a buffer to fill the entire buffer.)I currently have an Ivy Bridge machine in use (Core i5-6200U laptop, Linux 4.4.0 x86-64 kernel, with
erms
in/proc/cpuinfo
flags). Because I wanted to find out if I can find a case where a custom memcpy() variant based onrep movsb
would outperform a straightforwardmemcpy()
, I wrote an overly complicated benchmark.The core idea is that the main program allocates three large memory areas:
original
,current
, andcorrect
, each exactly the same size, and at least page-aligned. The copy operations are grouped into sets, with each set having distinct properties, like all sources and targets being aligned (to some number of bytes), or all lengths being within the same range. Each set is described using an array ofsrc
,dst
,n
triplets, where allsrc
tosrc+n-1
anddst
todst+n-1
are completely within thecurrent
area.A Xorshift* PRNG is used to initialize
original
to random data. (Like I warned above, this is overly complicated, but I wanted to ensure I'm not leaving any easy shortcuts for the compiler.) Thecorrect
area is obtained by starting withoriginal
data incurrent
, applying all the triplets in the current set, usingmemcpy()
provided by the C library, and copying thecurrent
area tocorrect
. This allows each benchmarked function to be verified to behave correctly.Each set of copy operations is timed a large number of times using the same function, and the median of these is used for comparison. (In my opinion, median makes the most sense in benchmarking, and provides sensible semantics -- the function is at least that fast at least half the time.)
To avoid compiler optimizations, I have the program load the functions and benchmarks dynamically, at run time. The functions all have the same form,
void function(void *, const void *, size_t)
-- note that unlikememcpy()
andmemmove()
, they return nothing. The benchmarks (named sets of copy operations) are generated dynamically by a function call (that takes the pointer to thecurrent
area and its size as parameters, among others).Unfortunately, I have not yet found any set where
would beat
using
gcc -Wall -O2 -march=ivybridge -mtune=ivybridge
using GCC 5.4.0 on aforementioned Core i5-6200U laptop running a linux-4.4.0 64-bit kernel. Copying 4096-byte aligned and sized chunks comes close, however.This means that at least thus far, I have not found a case where using a
rep movsb
memcpy variant would make sense. It does not mean there is no such case; I just haven't found one.(At this point the code is a spaghetti mess I'm more ashamed than proud of, so I shall omit publishing the sources unless someone asks. The above description should be enough to write a better one, though.)
This does not surprise me much, though. The C compiler can infer a lot of information about the alignment of the operand pointers, and whether the number of bytes to copy is a compile-time constant, a multiple of a suitable power of two. This information can, and will/should, be used by the compiler to replace the C library
memcpy()
/memmove()
functions with its own.GCC does exactly this (see e.g. gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() in the GCC sources; also look for
stringop_algs
in the same file to see architecture-dependent variants). Indeed,memcpy()
/memset()
/memmove()
has already been separately optimized for quite a few x86 processor variants; it would quite surprise me if the GCC developers had not already included erms support.GCC provides several function attributes that developers can use to ensure good generated code. For example,
alloc_align (n)
tells GCC that the function returns memory aligned to at leastn
bytes. An application or a library can choose which implementation of a function to use at run time, by creating a "resolver function" (that returns a function pointer), and defining the function using theifunc (resolver)
attribute.One of the most common patterns I use in my code for this is
where
ptr
is some pointer,alignment
is the number of bytes it is aligned to; GCC then knows/assumes thatpointer
is aligned toalignment
bytes.Another useful built-in, albeit much harder to use correctly, is
__builtin_prefetch()
. To maximize overall bandwidth/efficiency, I have found that minimizing latencies in each sub-operation, yields the best results. (For copying scattered elements to consecutive temporary storage, this is difficult, as prefetching typically involves a full cache line; if too many elements are prefetched, most of the cache is wasted by storing unused items.)There are far more efficient ways to move data. These days, the implementation of
memcpy
will generate architecture specific code from the compiler that is optimized based upon the memory alignment of the data and other factors. This allows better use of non-temporal cache instructions and XMM and other registers in the x86 world.When you hard-code
rep movsb
prevents this use of intrinsics.Therefore, for something like a
memcpy
, unless you are writing something that will be tied to a very specific piece of hardware and unless you are going to take the time to write a highly optimizedmemcpy
function in assembly (or using C level intrinsics), you are far better off allowing the compiler to figure it out for you.You say that you want:
But I'm not sure it means what you think it means. Looking at the 3.7.6.1 docs you link to, it explicitly says:
So just because
CPUID
indicates support for ERMSB, that isn't a guarantee that REP MOVSB will be the fastest way to copy memory. It just means it won't suck as bad as it has in some previous CPUs.However just because there may be alternatives that can, under certain conditions, run faster doesn't mean that REP MOVSB is useless. Now that the performance penalties that this instruction used to incur are gone, it is potentially a useful instruction again.
Remember, it is a tiny bit of code (2 bytes!) compared to some of the more involved memcpy routines I have seen. Since loading and running big chunks of code also has a penalty (throwing some of your other code out of the cpu's cache), sometimes the 'benefit' of AVX et al is going to be offset by the impact it has on the rest of your code. Depends on what you are doing.
You also ask:
It isn't going to be possible to "do something" to make REP MOVSB run any faster. It does what it does.
If you want the higher speeds you are seeing from from memcpy, you can dig up the source for it. It's out there somewhere. Or you can trace into it from a debugger and see the actual code paths being taken. My expectation is that it's using some of those AVX instructions to work with 128 or 256bits at a time.
Or you can just... Well, you asked us not to say it.
This is a topic pretty near to my heart and recent investigations, so I'll look at it from a few angles: history, some technical notes (mostly academic), test results on my box, and finally an attempt to answer your actual question of when and where
rep movsb
might make sense.Partly, this is a call to share results - if you can run Tinymembench and share the results along with details of your CPU and RAM configuration it would be great. Especially if you have a 4-channel setup, an Ivy Bridge box, a server box, etc.
History and Official Advice
The performance history of the fast string copy instructions has been a bit of a stair-step affair - i.e., periods of stagnant performance alternating with big upgrades that brought them into line or even faster than competing approaches. For example, there was a jump in performance in Nehalem (mostly targeting startup overheads) and again in Ivy Bridge (most targeting total throughput for large copies). You can find decade-old insight on the difficulties of implementing the
rep movs
instructions from an Intel engineer in this thread.For example, in guides preceding the introduction of Ivy Bridge, the typical advice is to avoid them or use them very carefully1.
The current (well, June 2016) guide has a variety of confusing and somewhat inconsistent advice, such as2:
So for copies of 3 or less bytes? You don't need a
rep
prefix for that in the first place, since with a claimed startup latency of ~9 cycles you are almost certainly better off with a simple DWORD or QWORDmov
with a bit of bit-twiddling to mask off the unused bytes (or perhaps with 2 explicit byte, wordmov
s if you know the size is exactly three).They go on to say:
This certainly seems wrong on current hardware with ERMSB where
rep movsb
is at least as fast, or faster, than themovd
ormovq
variants for large copies.In general, that section (3.7.5) of the current guide contains a mix of reasonable and badly obsolete advice. This is common throughput the Intel manuals, since they are updated in an incremental fashion for each architecture (and purport to cover nearly two decades worth of architectures even in the current manual), and old sections are often not updated to replace or make conditional advice that doesn't apply to the current architecture.
They then go on to cover ERMSB explicitly in section 3.7.6.
I won't go over the remaining advice exhaustively, but I'll summarize the good parts in the "why use it" below.
Other important claims from the guide are that on Haswell,
rep movsb
has been enhanced to use 256-bit operations internally.Technical Considerations
This is just a quick summary of the underlying advantages and disadvantages that the
rep
instructions have from an implementation standpoint.Advantages for
rep movs
When a
rep
movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:memcpy
-like patterns, but it still takes a couple of reads to kick in and will "over-prefetch" many cache lines beyond the end of the copied region.rep movsb
knows exactly the region size and can prefetch exactly.Apparently, there is no guarantee of ordering among the stores within3 a single
rep movs
which can help simplify coherency traffic and simply other aspects of the block move, versus simplemov
instructions which have to obey rather strict memory ordering4.In principle, the
rep movs
instruction could take advantage of various architectural tricks that aren't exposed in the ISA. For example, architectures may have wider internal data paths that the ISA exposes5 andrep movs
could use that internally.Disadvantages
rep movsb
must implement a specific semantic which may be stronger than the underlying software requirement. In particular,memcpy
forbids overlapping regions, and so may ignore that possibility, butrep movsb
allows them and must produce the expected result. On current implementations mostly affects to startup overhead, but probably not to large-block throughput. Similarly,rep movsb
must support byte-granular copies even if you are actually using it to copy large blocks which are a multiple of some large power of 2.The software may have information about alignment, copy size and possible aliasing that cannot be communicated to the hardware if using
rep movsb
. Compilers can often determine the alignment of memory blocks6 and so can avoid much of the startup work thatrep movs
must do on every invocation.Test Results
Here are test results for many different copy methods from
tinymembench
on my i7-6700HQ at 2.6 GHz (too bad I have the identical CPU so we aren't getting a new data point...):Some key takeaways:
rep movs
methods are faster than all the other methods which aren't "non-temporal"7, and considerably faster than the "C" approaches which copy 8 bytes at a time.rep movs
ones - but that's a much smaller delta than the one you reported (26 GB/s vs 15 GB/s = ~73%).memcpy
) but it probably doesn't matter due to the above note.rep movs
approaches lie in the middle.rep movsd
seems to use the same magic asrep movsb
on this chip. That's interesting because ERMSB only explicitly targetsmovsb
and earlier tests on earlier archs with ERMSB showmovsb
performing much faster thanmovsd
. This is mostly academic sincemovsb
is more general thanmovsd
anyway.Haswell
Looking at the Haswell results kindly provided by iwillnotexist in the comments, we see the same general trends (most relevant results extracted):
The
rep movsb
approach is still slower than the non-temporalmemcpy
, but only by about 14% here (compared to ~26% in the Skylake test). The advantage of the NT techniques above their temporal cousins is now ~57%, even a bit more than the theoretical benefit of the bandwidth reduction.When should you use
rep movs
?Finally a stab at your actual question: when or why should you use it? It draw on the above and introduces a few new ideas. Unfortunately there is no simple answer: you'll have to trade off various factors, including some which you probably can't even know exactly, such as future developments.
A note that the alternative to
rep movsb
may be the optimized libcmemcpy
(including copies inlined by the compiler), or it may be a hand-rolledmemcpy
version. Some of the benefits below apply only in comparison to one or the other of these alternatives (e.g., "simplicity" helps against a hand-rolled version, but not against built-inmemcpy
), but some apply to both.Restrictions on available instructions
In some environments there there is a restriction on certain instructions or using certain registers. For example, in the Linux kernel, use of SSE/AVX or FP registers is generally disallowed. Therefore most of the optimized
memcpy
variants cannot be used as they rely on SSE or AVX registers, and a plain 64-bitmov
-based copy is used on x86. For these platforms, usingrep movsb
allows most of the performance of an optimizedmemcpy
without breaking the restriction on SIMD code.A more general example might be code that has to target many generations of hardware, and which doesn't use hardware-specific dispatching (e.g., using
cpuid
). Here you might be forced to use only older instruction sets, which rules out any AVX, etc.rep movsb
might be a good approach here since it allows "hidden" access to wider loads and stores without using new instructions. If you target pre-ERMSB hardware you'd have to see ifrep movsb
performance is acceptable there, though...Future Proofing
A nice aspect of
rep movsb
is that it can, in theory take advantage of architectural improvement on future architectures, without source changes, that explicit moves cannot. For example, when 256-bit data paths were introduced,rep movsb
was able to take advantage of them (as claimed by Intel) without any changes needed to the software. Software using 128-bit moves (which was optimal prior to Haswell) would have to be modified and recompiled.So it is both a software maintenance benefit (no need to change source) and a benefit for existing binaries (no need to deploy new binaries to take advantage of the improvement).
How important this is depends on your maintenance model (e.g., how often new binaries are deployed in practice) and a the very difficult to make judgement of how fast these instructions are likely to be in the future. At least Intel is kind of guiding uses in this direction though, by committing to at least reasonable performance in the future (15.3.3.6):
Overlapping with subsequent work
This benefit won't show up in a plain
memcpy
benchmark of course, which by definition doesn't have subsequent work to overlap, so the magnitude of the benefit would have to be carefully measured in a real-world scenario. Taking maximum advantage might require re-organization of the code surrounding thememcpy
.This benefit is pointed out by Intel in their optimization manual (section 11.16.3.4) and in their words:
So Intel is saying that after all some uops the code after
rep movsb
has issued, but while lots of stores are still in flight and therep movsb
as a whole hasn't retired yet, uops from following instructions can make more progress through the out-of-order machinery than they could if that code came after a copy loop.The uops from an explicit load and store loop all have to actually retire separately in program order. That has to happen to make room in the ROB for following uops.
There doesn't seem to be much detailed information about how very long microcoded instruction like
rep movsb
work, exactly. We don't know exactly how micro-code branches request a different stream of uops from the microcode sequencer, or how the uops retire. If the individual uops don't have to retire separately, perhaps the whole instruction only takes up one slot in the ROB?When the front-end that feeds the OoO machinery sees a
rep movsb
instruction in the uop cache, it activates the Microcode Sequencer ROM (MS-ROM) to send microcode uops into the queue that feeds the issue/rename stage. It's probably not possible for any other uops to mix in with that and issue/execute8 whilerep movsb
is still issuing, but subsequent instructions can be fetched/decoded and issue right after the lastrep movsb
uop does, while some of the copy hasn't executed yet. This is only useful if at least some of your subsequent code doesn't depend on the result of thememcpy
(which isn't unusual).Now, the size of this benefit is limited: at most you can execute N instructions (uops actually) beyond the slow
rep movsb
instruction, at which point you'll stall, where N is the ROB size. With current ROB sizes of ~200 (192 on Haswell, 224 on Skylake), that's a maximum benefit of ~200 cycles of free work for subsequent code with an IPC of 1. In 200 cycles you can copy somewhere around 800 bytes at 10 GB/s, so for copies of that size you may get free work close to the cost of the copy (in a way making the copy free).As copy sizes get much larger, however, the relative importance of this diminishes rapidly (e.g., if you are copying 80 KB instead, the free work is only 1% of the copy cost). Still, it is quite interesting for modest-sized copies.
Copy loops don't totally block subsequent instructions from executing, either. Intel does not go into detail on the size of the benefit, or on what kind of copies or surrounding code there is most benefit. (Hot or cold destination or source, high ILP or low ILP high-latency code after).
Code Size
The executed code size (a few bytes) is microscopic compared to a typical optimized
memcpy
routine. If performance is at all limited by i-cache (including uop cache) misses, the reduced code size might be of benefit.Again, we can bound the magnitude of this benefit based on the size of the copy. I won't actually work it out numerically, but the intuition is that reducing the dynamic code size by B bytes can save at most
C * B
cache-misses, for some constant C. Every call tomemcpy
incurs the cache miss cost (or benefit) once, but the advantage of higher throughput scales with the number of bytes copied. So for large transfers, higher throughput will dominate the cache effects.Again, this is not something that will show up in a plain benchmark, where the entire loop will no doubt fit in the uop cache. You'll need a real-world, in-place test to evaluate this effect.
Architecture Specific Optimization
You reported that on your hardware,
rep movsb
was considerably slower than the platformmemcpy
. However, even here there are reports of the opposite result on earlier hardware (like Ivy Bridge).That's entirely plausible, since it seems that the string move operations get love periodically - but not every generation, so it may well be faster or at least tied (at which point it may win based on other advantages) on the architectures where it has been brought up to date, only to fall behind in subsequent hardware.
Quoting Andy Glew, who should know a thing or two about this after implementing these on the P6:
In that case, it can be seen as just another "platform specific" optimization to apply in the typical every-trick-in-the-book
memcpy
routines you find in standard libraries and JIT compilers: but only for use on architectures where it is better. For JIT or AOT-compiled stuff this is easy, but for statically compiled binaries this does require platform specific dispatch, but that often already exists (sometimes implemented at link time), or themtune
argument can be used to make a static decision.Simplicity
Even on Skylake, where it seems like it has fallen behind the absolute fastest non-temporal techniques, it is still faster than most approaches and is very simple. This means less time in validation, fewer mystery bugs, less time tuning and updating a monster
memcpy
implementation (or, conversely, less dependency on the whims of the standard library implementors if you rely on that).Latency Bound Platforms
Memory throughput bound algorithms9 can actually be operating in two main overall regimes: DRAM bandwidth bound or concurrency/latency bound.
The first mode is the one that you are probably familiar with: the DRAM subsystem has a certain theoretic bandwidth that you can calculate pretty easily based on the number of channels, data rate/width and frequency. For example, my DDR4-2133 system with 2 channels has a max bandwidth of 2.133 * 8 * 2 = 34.1 GB/s, same as reported on ARK.
You won't sustain more than that rate from DRAM (and usually somewhat less due to various inefficiencies) added across all cores on the socket (i.e., it is a global limit for single-socket systems).
The other limit is imposed by how many concurrent requests a core can actually issue to the memory subsystem. Imagine if a core could only have 1 request in progress at once, for a 64-byte cache line - when the request completed, you could issue another. Assume also very fast 50ns memory latency. Then despite the large 34.1 GB/s DRAM bandwidth, you'd actually only get 64 bytes / 50 ns = 1.28 GB/s, or less than 4% of the max bandwidth.
In practice, cores can issue more than one request at a time, but not an unlimited number. It is usually understood that there are only 10 line fill buffers per core between the L1 and the rest of the memory hierarchy, and perhaps 16 or so fill buffers between L2 and DRAM. Prefetching competes for the same resources, but at least helps reduce the effective latency. For more details look at any of the great posts Dr. Bandwidth has written on the topic, mostly on the Intel forums.
Still, most recent CPUs are limited by this factor, not the RAM bandwidth. Typically they achieve 12 - 20 GB/s per core, while the RAM bandwidth may be 50+ GB/s (on a 4 channel system). Only some recent gen 2-channel "client" cores, which seem to have a better uncore, perhaps more line buffers can hit the DRAM limit on a single core, and our Skylake chips seem to be one of them.
Now of course, there is a reason Intel designs systems with 50 GB/s DRAM bandwidth, while only being to sustain < 20 GB/s per core due to concurrency limits: the former limit is socket-wide and the latter is per core. So each core on an 8 core system can push 20 GB/s worth of requests, at which point they will be DRAM limited again.
Why I am going on and on about this? Because the best
memcpy
implementation often depends on which regime you are operating in. Once you are DRAM BW limited (as our chips apparently are, but most aren't on a single core), using non-temporal writes becomes very important since it saves the read-for-ownership that normally wastes 1/3 of your bandwidth. You see that exactly in the test results above: the memcpy implementations that don't use NT stores lose 1/3 of their bandwidth.If you are concurrency limited, however, the situation equalizes and sometimes reverses, however. You have DRAM bandwidth to spare, so NT stores don't help and they can even hurt since they may increase the latency since the handoff time for the line buffer may be longer than a scenario where prefetch brings the RFO line into LLC (or even L2) and then the store completes in LLC for an effective lower latency. Finally, server uncores tend to have much slower NT stores than client ones (and high bandwidth), which accentuates this effect.
So on other platforms you might find that NT stores are less useful (at least when you care about single-threaded performance) and perhaps
rep movsb
wins where (if it gets the best of both worlds).Really, this last item is a call for most testing. I know that NT stores lose their apparent advantage for single-threaded tests on most archs (including current server archs), but I don't know how
rep movsb
will perform relatively...References
Other good sources of info not integrated in the above.
comp.arch investigation of
rep movsb
versus alternatives. Lots of good notes about branch prediction, and an implementation of the approach I've often suggested for small blocks: using overlapping first and/or last read/writes rather than trying to write only exactly the required number of bytes (for example, implementing all copies from 9 to 16 bytes as two 8-byte copies which might overlap in up to 7 bytes).1 Presumably the intention is to restrict it to cases where, for example, code-size is very important.
2 See Section 3.7.5: REP Prefix and Data Movement.
3 It is key to note this applies only for the various stores within the single instruction itself: once complete, the block of stores still appear ordered with respect to prior and subsequent stores. So code can see stores from the
rep movs
out of order with respect to each other but not with respect to prior or subsequent stores (and it's the latter guarantee you usually need). It will only be a problem if you use the end of the copy destination as a synchronization flag, instead of a separate store.4 Note that non-temporal discrete stores also avoid most of the ordering requirements, although in practice
rep movs
has even more freedom since there are still some ordering constraints on WC/NT stores.5 This is was common in the latter part of the 32-bit era, where many chips had 64-bit data paths (e.g, to support FPUs which had support for the 64-bit
double
type). Today, "neutered" chips such as the Pentium or Celeron brands have AVX disabled, but presumablyrep movs
microcode can still use 256b loads/stores.6 E.g., due to language alignment rules, alignment attributes or operators, aliasing rules or other information determined at compile time. In the case of alignment, even if the exact alignment can't be determined, they may at least be able to hoist alignment checks out of loops or otherwise eliminate redundant checks.
7 I'm making the assumption that "standard"
memcpy
is choosing a non-temporal approach, which is highly likely for this size of buffer.8 That isn't necessarily obvious, since it could be the case that the uop stream that is generated by the
rep movsb
simply monopolizes dispatch and then it would look very much like the explicitmov
case. It seems that it doesn't work like that however - uops from subsequent instructions can mingle with uops from the microcodedrep movsb
.9 I.e., those which can issue a large number of independent memory requests and hence saturate the available DRAM-to-core bandwidth, of which
memcpy
would be a poster child (and as apposed to purely latency bound loads such as pointer chasing).Enhanced REP MOVSB (Ivy Bridge and later)
Ivy Bridge microarchitecture (processors released in 2012 and 2013) introduced Enhanced REP MOVSB (we still need to check the corresponding bit) and allowed us to copy memory fast.
Cheapest versions of later processors - Kaby Lake Celeron and Pentium, released in 2017, don't have AVX that could have been used for fast memory copy, but still have the Enhanced REP MOVSB.
REP MOVSB (ERMSB) is only faster than AVX copy or general-use register copy if the block size is at least 256 bytes. For the blocks below 64 bytes, it is MUCH slower, because there is high internal startup in ERMSB - about 35 cycles.
See the Intel Manual on Optimization, section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
As I said earlier, REP MOVSB begin to outperform other methods when the length is at least 256 bytes, but to see the clear benefit over AVX copy, the length have to be more than 2048 bytes.
On the effect of the alignment if REP MOVSB vs. AVX copy, the Intel Manual gives the following information:
I have made tests on Intel Core i5-6600, under 64-bit, and I have compared REP MOVSB memcpy() with a simple MOV RAX, [SRC]; MOV [DST], RAX implementation when the data fits L1 cache:
REP MOVSB memcpy():
MOV RAX... memcpy():
So, even on 128-bit blocks, REP MOVSB is slower than just a simple MOV RAX copy in a loop (not unrolled). The ERMSB implementation begins to outperform the MOV RAX loop only starting form 256-byte blocks.
Normal (not enhanced) REP MOVS on Nehalem and later
Surprisingly, previous architectures (Nehalem and later), that didn't yet have Enhanced REP MOVB, had quite fast REP MOVSD/MOVSQ (but not REP MOVSB/MOVSW) implementation for large blocks, but not large enough to outsize the L1 cache.
Intel Optimization Manual (2.5.6 REP String Enhancement) gives the following information is related to Nehalem microarchitecture - Intel Core i5, i7 and Xeon processors released in 2009 and 2010.
REP MOVSB
The latency for MOVSB, is 9 cycles if ECX < 4; otherwise REP MOVSB with ECX > 9 have a 50-cycle startup cost.
My conclusion: REP MOVSB is almost useless on Nehalem.
MOVSW/MOVSD/MOVSQ
Quote from the Intel Optimization Manual (2.5.6 REP String Enhancement):
Intel does not seem to be correct here. From the above quote we understand that for very large memory blocks, REP MOVSW is as fast as REP MOVSD/MOVSQ, but tests have shown that only REP MOVSD/MOVSQ are fast, while REP MOVSW is even slower than REP MOVSB on Nehalem and Westmere.
According to the information provided by Intel in the manual, on previous Intel microarchitectures (before 2008) the startup costs are even higher.
Conclusion: if you just need to copy data that fits L1 cache, just 4 cycles to copy 64 bytes of data is excellent, and you don't need to use XMM registers!
REP MOVSD/MOVSQ is the universal solution that works excellent on all Intel processors (no ERMSB required) if the data fits L1 cache
Here are the tests of REP MOVS* when the source and destination was in the L1 cache, of blocks large enough to not be seriously affected by startup costs, but not that large to exceed the L1 cache size. Source: http://users.atw.hu/instlatx64/
Yonah (2006-2008)
Nehalem (2009-2010)
Westmere (2010-2011)
Ivy Bridge (2012-2013) - with Enhanced REP MOVSB
SkyLake (2015-2016) - with Enhanced REP MOVSB
Kaby Lake (2016-2017) - with Enhanced REP MOVSB
As you see, the implementation of REP MOVS differs significantly from one microarchitecture to another. On some processors, like Ivy Bridge - REP MOVSB is fastest, albeit just slightly faster than REP MOVSD/MOVSQ, but no doubt that on all processors since Nehalem, REP MOVSD/MOVSQ works very well - you even don't need "Enhanced REP MOVSB", since, on Ivy Bridge (2013) with Enhacnced REP MOVSB, REP MOVSD shows the same byte per clock data as on Nehalem (2010) without Enhacnced REP MOVSB, while in fact REP MOVSB became very fast only since SkyLake (2015) - twice as fast as on Ivy Bridge. So this Enhacnced REP MOVSB bit in the CPUID may be confusing - it only shows that
REP MOVSB
per se is OK, but not that anyREP MOVS*
is faster.The most confusing ERMBSB implementation is on the Ivy Bridge microarchitecture. Yes, on very old processors, before ERMSB, REP MOVS* for large blocks did use a cache protocol feature that is not available to regular code (no-RFO). But this protocol is no longer used on Ivy Bridge that has ERMSB. According to Andy Glew's comments on an answer to "why are complicated memcpy/memset superior?" from a Peter Cordes answer, a cache protocol feature that is not available to regular code was once used on older processors, but no longer on Ivy Bridge. And there comes an explanation of why the startup costs are so high for REP MOVS*: „The large overhead for choosing and setting up the right method is mainly due to the lack of microcode branch prediction”. There has also been an interesting note that Pentium Pro (P6) in 1996 implemented REP MOVS* with 64 bit microcode loads and stores and a no-RFO cache protocol - they did not violate memory ordering, unlike ERMSB in Ivy Bridge.
Disclaimer