How to calculate approximate reciprocal square root of an array faster on a cpu with popcnt and SSE4.2?
The input is positive integers (ranges from 0 to about 200,000) stored in an array of floats.
The output is an array of floats.
Both arrays have correct memory alignment for sse.
The code below only use 1 xmm register, runs on linux, and can be compiled by gcc -O3 code.cpp -lrt -msse4.2
Thank you.
#include <iostream>
#include <emmintrin.h>
#include <time.h>
using namespace std;
void print_xmm(__m128 xmm){
float out[4];
_mm_storeu_ps(out,xmm);
int i;
for (i = 0; i < 4; ++i) std::cout << out[i] << " ";
std::cout << std::endl;
}
void print_arr(float* ptr, size_t size){
size_t i;
for(i = 0; i < size; ++i){
cout << ptr[i] << " ";
}
cout << endl;
}
int main(void){
size_t size = 25000 * 4;
// this has to be multiple of 4
size_t repeat = 10000;
// test 10000 cycles of the code
float* ar_in = (float*)aligned_alloc(16, size*sizeof(float));
float* ar_out = (float*)aligned_alloc(16, size*sizeof(float));
//fill test data into the input array
//the data is an array of positive numbers.
size_t i;
for (i = 0; i < size; ++i){
ar_in[i] = (i+1) * (i+1);
}
//prepare for recipical square root.
__m128 xmm0;
size_t size_fix = size*sizeof(float)/sizeof(__m128);
float* ar_in_end = ar_in + size_fix;
float* ar_out_now;
float* ar_in_now;
//timing
struct timespec tp_start, tp_end;
i = repeat;
clock_gettime(CLOCK_MONOTONIC, &tp_start);
//start timing
while(--i){
ar_out_now = ar_out;
for(ar_in_now = ar_in;
ar_in_now != ar_in_end;
ar_in_now += 4, ar_out_now+=4){
//4 = sizeof(__m128)/sizeof(float);
xmm0 = _mm_load_ps(ar_in_now);
//cout << "load xmm: ";
//print_xmm(xmm0);
xmm0 = _mm_rsqrt_ps(xmm0);
//cout << "rsqrt xmm: ";
//print_xmm(xmm0);
_mm_store_ps(ar_out_now,xmm0);
}
}
//end timing
clock_gettime(CLOCK_MONOTONIC, &tp_end);
double timing;
const double nano = 0.000000001;
timing = ((double)(tp_end.tv_sec - tp_start.tv_sec )
+ (tp_end.tv_nsec - tp_start.tv_nsec) * nano)/repeat;
cout << " timing per cycle: " << timing << endl;
/*
cout << "input array: ";
print_arr(ar_in, size);
cout << "output array: ";
print_arr(ar_out,size);
*/
//free mem
free(ar_in);
free(ar_out);
return 0;
}
How big is your array of floats? If it's already hot in L1 (or maybe L2), gcc5.3 output for this code bottlenecks on uop throughput on modern Intel CPUs, since it makes a loop with 6 fused-domain uops that does one vector per iteration. (So it will run at one vector per 2 cycles).
To achieve 1 vector per clock throughput on modern Intel CPUs, you need the loop to be unrolled (see below for why non-unrolled asm can't work). Having the compiler do it for you is probably good (instead of doing it by hand in the C++ source). e.g. use profile-guided optimization (gcc -fprofile-use
), or just blindly use -funroll-loops
.
16 bytes per clock is enough in theory to saturate main memory bandwidth with one core. However, IIRC Z Boson has observed better bandwidth from using multiple cores, probably because multiple cores keep more requests outstanding, and a stall on one core doesn't leave memory idle. If the input is hot in L2 on a core, though, it's probably best to process the data using that core.
On Haswell or later, 16 bytes loaded and stored per clock is only half of L1 cache bandwidth, so you need an AVX version of this to max per-core bandwidth.
If you bottleneck on memory, you might want to do a Newton-Raphson iteration to get a nearly-full accuracy 1/sqrt(x)
, especially if you use multiple threads for a big array. (Because then it's ok if a single thread can't sustain one load+store per clock.)
Or maybe just use rsqrt
on the fly when loading this data later. It's very cheap, with high throughput, but still latency similar to an FP add. Again, if it's a big array that doesn't fit in cache, increasing computational intensity by doing fewer separate passes over the data is a big deal. (Cache Blocking aka Loop Tiling would also be a good idea, if you can do so: run multiple steps of your algorithm on a cache-sized chunk of your data.)
Only use cache-bypassing NT stores as a last resort if you can't find a way to make effective use of cache. It's much better if you can convert some data that you're about to use, so it's in cache when it's next used.
The main loop (from .L31
to jne .L31
on the Godbolt compiler explorer) is 6 uops for Intel SnB-family CPUs, because indexed addressing modes don't micro-fuse. (This isn't yet documented in Agner Fog's microarch pdf, unfortunately.)
It's 4 fused-domain uops on Nehalem, with only three ALU uops, so Nehalem should run this at 1 per clock.
.L31: # the main loop: 6 uops on SnB-family, 4 uops on Nehalem
rsqrtps xmm0, XMMWORD PTR [rbx+rax] # tmp127, MEM[base: ar_in_now_10, index: ivtmp.51_61, offset: 0B]
movaps XMMWORD PTR [rbp+0+rax], xmm0 # MEM[base: ar_out_now_12, index: ivtmp.51_61, offset: 0B], tmp127
add rax, 16 # ivtmp.51,
cmp rax, 100000 # ivtmp.51,
jne .L31 #,
Since you want to write a separate destination, there's no way to get the loop down to 4 fused-domain uops so it can run at one vector per clock without unrolling. (Both the load and the store need to be one-register addressing modes, so the trick of using src - dst
indexed by current_dst
instead of incrementing src
doesn't work).
Modifying your C++ so gcc would use a pointer increment would only save one uop, because you have to increment src and dst. i.e. float *endp = start + length;
and for (p = start ; p < endp ; p+=4) {}
would make a loop like
.loop:
rsqrtps xmm0, [rsi]
add rsi, 16
movaps [rdi], xmm0
add rdi, 16
cmp rdi, rbx
jne .loop
Hopefully gcc will do something like this while unrolling, otherwise the rsqrtps
+ movaps
will be 4 fused-domain uops on their own if they still use indexed addressing mode, and no amount of unrolling will make your loop run at one vector per clock.
Because this is a streaming calculation with very low arithmetic intensity, you are almost certainly memory bound. You will likely find a measurable speedup if you use non-temporal loads and stores.
// Hint to the CPU that we don't want to use the cache
// Be sure to update this if you use manual loop unrolling
_mm_prefetch(reinterpret_cast<char*>(ar_in+4), _MM_HINT_NTA);
// Hint to the CPU that we don't need to write back through the cache
_mm_stream_ps(ar_out_now,xmm0);
EDIT:
I ran some tests to see what things look like on different hardware. Unsurprisingly, the results are quite sensitive to the hardware in use. I would posit that it is likely due to the increased number of read/write buffers in modern CPUs.
All code was compiled using gcc-6.1 with
gcc -std=c++14 -O3 -march=native -mavx -mfpmath=sse -ffast-math
Intel Core i3-3120M @ 2.5GHz; 3MB cache
OP's code: 350 +- 10 milliseconds
NTA Prefetch: 360 +- 5 milliseconds
NTA Prefetch+NTA store: 430 +- 10 milliseconds
Intel Core i7-6500U CPU @ 2.50GHz; 4MB cache
OP's code: 205 +- 5 milliseconds
NTA Prefetch: 220 +- 2 milliseconds
NTA Prefetch+NTA store: 200 +- 5 milliseconds
Increasing the problem size to 2MB, the NTA Prefetch+NTA store sees a ~30% decrease in runtime over OP's solution.
Results: The problem size is too small to substantially benefit from NTA. On older architectures, it is detrimental. On newer architectures, it is on par with the OP's solution.
Conclusion: Probably not worth the extra effort in this case.
You have to measure it, of course, but there is known code to compute (not very precise) inverse square root, check https://www.beyond3d.com/content/articles/8/
float InvSqrt (float x) {
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
Tested (with VS2015 and GCC 5.4.0) conversion to SSE2, link
__m128 InvSqrtSSE2(__m128 x) {
__m128 xhalf = _mm_mul_ps(x, _mm_set1_ps(0.5f));
x = _mm_castsi128_ps(_mm_sub_epi32(_mm_set1_epi32(0x5f3759df), _mm_srai_epi32(_mm_castps_si128(x), 1)));
return _mm_mul_ps(x, _mm_sub_ps(_mm_set1_ps(1.5f), _mm_mul_ps(xhalf, _mm_mul_ps(x, x))));
}
UPDATE I
Mea culpa! Thanks to @EOF, who noticed improper conversion, those are replaced with casts