I'm tried to improve performance of copy operation via SSE and AVX:
#include <immintrin.h>
const int sz = 1024;
float *mas = (float *)_mm_malloc(sz*sizeof(float), 16);
float *tar = (float *)_mm_malloc(sz*sizeof(float), 16);
float a=0;
std::generate(mas, mas+sz, [&](){return ++a;});
const int nn = 1000;//Number of iteration in tester loops
std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3;
//std::copy testing
start1 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
std::copy(mas, mas+sz, tar);
end1 = std::chrono::system_clock::now();
float elapsed1 = std::chrono::duration_cast<std::chrono::microseconds>(end1-start1).count();
//SSE-copy testing
start2 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
auto _mas = mas;
auto _tar = tar;
for(; _mas!=mas+sz; _mas+=4, _tar+=4)
{
__m128 buffer = _mm_load_ps(_mas);
_mm_store_ps(_tar, buffer);
}
}
end2 = std::chrono::system_clock::now();
float elapsed2 = std::chrono::duration_cast<std::chrono::microseconds>(end2-start2).count();
//AVX-copy testing
start3 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
auto _mas = mas;
auto _tar = tar;
for(; _mas!=mas+sz; _mas+=8, _tar+=8)
{
__m256 buffer = _mm256_load_ps(_mas);
_mm256_store_ps(_tar, buffer);
}
}
end3 = std::chrono::system_clock::now();
float elapsed3 = std::chrono::duration_cast<std::chrono::microseconds>(end3-start3).count();
std::cout<<"serial - "<<elapsed1<<", SSE - "<<elapsed2<<", AVX - "<<elapsed3<<"\nSSE gain: "<<elapsed1/elapsed2<<"\nAVX gain: "<<elapsed1/elapsed3;
_mm_free(mas);
_mm_free(tar);
It works. However, while the number of iterations in tester-loops - nn - increases, performance gain of simd-copy decreases:
nn=10: SSE-gain=3, AVX-gain=6;
nn=100: SSE-gain=0.75, AVX-gain=1.5;
nn=1000: SSE-gain=0.55, AVX-gain=1.1;
Can anybody explain what is the reason of mentioned performance decrease effect and is it advisable to manually vectorization of copy operation?
This is an very interesting question, but I believe non of the answers so far is correct because the question itself is so misleading.
The title should be changed to "How does one reach the theoretical memory I/O bandwidth ?"
No matter what instruction set is used, CPU is so much faster than RAM that pure block memory copy is 100% I/O bounded. And this explains why there is little difference between SSE and AVX performance.
For small buffers hot in L1D cache, AVX can copy significantly faster than SSE on CPUs like Haswell where 256b loads/stores really do use a 256b data path to L1D cache instead of splitting into two 128b operations.
Ironically, ancient X86 instruction rep stosq performs much better than SSE and AVX in terms of memory copy!
The article here explains how to saturate memory bandwidth really well and it has rich references to explore further as well.
See also Enhanced REP MOVSB for memcpy here on SO, where @BeeOnRope's answer discusses NT stores (and non-RFO stores done by
rep stosb/stosq
) vs. regular stores, and how single-core memory bandwidth is often limited by max concurrency / latency, not by the memory controller itself.I think this is because measuring not accurate for kinda short operations.
When measuring performance on Intel CPU
Disable "Turbo Boost" and "SpeedStep". You can to this on system BIOS.
Change Process/Thread priority to High or Realtime. This will keep your thread running.
Set Process CPU Mask to only one core. CPU Masking with Higher priority will minimize context switching.
use __rdtsc() intrinsic function. Intel Core series returns CPU internal clock counter with __rdtsc(). You will get 3400000000 counts/second from 3.4Ghz CPU. And __rdtsc() flushes all scheduled operations in CPU so it can measure timing more accurate.
This is my test-bed startup code for testing SSE/AVX codes.
I think that your main problem/bottleneck is your
_mm_malloc
.I highly suggest to use
std::vector
as your main data structure if you are concerned about locality in C++.intrinsics are not exactly a "library", they are more like a builtin function provided to you from your compiler, you should be familiar with your compiler internals/docs before using this functions.
Also note that the fact that the
AVX
are a newer thanSSE
doesn't make theAVX
faster, whatever you are planning to use, the number of cycles taken by an function is probably more important than the "avx vs sse" argument, for example see this answer.Try with a POD
int array[]
or anstd::vector
.The problem is that your test does a poor job to migrate some factors in the hardware that make benchmarking hard. To test this, I've made my own test case. Something like this:
output:
So in this case, AVX is a bunch faster than
std::copy
. What happens when I change to test case to..Notice that absolutely nothing changed, except the order of the tests.
Woah! how is that possible? The CPU takes a while to ramp up to full speed, so tests that are run later have an advantage. This question has 3 answers now, including an 'accepted' answer. But only the one with the lowest amount of upvotes was on the right track.
This is one of the reasons why benchmarking is hard and you should never trust anyone's micro-benchmarks unless they've included detailed information of their setup. It isn't just the code that can go wrong. Power saving features and weird drivers can completely mess up your benchmark. One time i've measured an factor 7 difference in performance by toggling a switch in the bios that less than 1% of notebooks offer.
Writing fast SSE is not as simple as using SSE operations in place of their non-parallel equivalents. In this case I suspect your compiler cannot usefully unroll the load/store pair and your time is dominated by stalls caused by using the output of one low-throughput operation (the load) in the very next instruction (the store).
You can test this idea by manually unrolling one notch:
Normally when using intrinsics I disassemble the output and make sure nothing crazy is going on (you could try this to verify if/how the original loop got unrolled). For more complex loops the right tool to use is the Intel Architecture Code Analyzer (IACA). It's a static analysis tool which can tell you things like "you have pipeline stalls".