Faster approximate reciprocal square root of an ar

2019-06-04 06:58发布

问题:

How to calculate approximate reciprocal square root of an array faster on a cpu with popcnt and SSE4.2?

The input is positive integers (ranges from 0 to about 200,000) stored in an array of floats.

The output is an array of floats.

Both arrays have correct memory alignment for sse.

The code below only use 1 xmm register, runs on linux, and can be compiled by gcc -O3 code.cpp -lrt -msse4.2

Thank you.

#include <iostream>
#include <emmintrin.h>
#include <time.h>

using namespace std;
void print_xmm(__m128 xmm){
    float out[4];
    _mm_storeu_ps(out,xmm);
    int i;
    for (i = 0; i < 4; ++i) std::cout << out[i] << " ";
    std::cout << std::endl;
}

void print_arr(float* ptr, size_t size){
    size_t i;
    for(i = 0; i < size; ++i){
        cout << ptr[i] << " ";
    }
    cout << endl;
}

int main(void){
    size_t size = 25000 * 4;
        // this has to be multiple of 4
    size_t repeat = 10000;
        // test 10000 cycles of the code
    float* ar_in = (float*)aligned_alloc(16, size*sizeof(float));
    float* ar_out = (float*)aligned_alloc(16, size*sizeof(float));
    //fill test data into the input array
    //the data is an array of positive numbers.
    size_t i;
    for (i = 0; i < size; ++i){
        ar_in[i] = (i+1) * (i+1);
    }
    //prepare for recipical square root.
    __m128 xmm0;
    size_t size_fix = size*sizeof(float)/sizeof(__m128);
    float* ar_in_end  = ar_in + size_fix;
    float* ar_out_now;
    float* ar_in_now;
    //timing
    struct timespec tp_start, tp_end;
    i = repeat;
    clock_gettime(CLOCK_MONOTONIC, &tp_start);
    //start timing
    while(--i){
        ar_out_now = ar_out;
        for(ar_in_now = ar_in;
            ar_in_now != ar_in_end;
            ar_in_now += 4, ar_out_now+=4){
            //4 = sizeof(__m128)/sizeof(float);
            xmm0 = _mm_load_ps(ar_in_now);
            //cout << "load xmm: ";
            //print_xmm(xmm0);
            xmm0 = _mm_rsqrt_ps(xmm0);
            //cout << "rsqrt xmm: ";
            //print_xmm(xmm0);
            _mm_store_ps(ar_out_now,xmm0);
        }
    }
    //end timing
    clock_gettime(CLOCK_MONOTONIC, &tp_end);
    double timing;
    const double nano = 0.000000001;

    timing = ((double)(tp_end.tv_sec  - tp_start.tv_sec )
           + (tp_end.tv_nsec - tp_start.tv_nsec) * nano)/repeat;

    cout << "    timing per cycle: " << timing << endl;
    /* 
    cout << "input array: ";
    print_arr(ar_in, size);
    cout << "output array: ";
    print_arr(ar_out,size);
    */
    //free mem
    free(ar_in);
    free(ar_out);
    return 0;
}

回答1:

How big is your array of floats? If it's already hot in L1 (or maybe L2), gcc5.3 output for this code bottlenecks on uop throughput on modern Intel CPUs, since it makes a loop with 6 fused-domain uops that does one vector per iteration. (So it will run at one vector per 2 cycles).

To achieve 1 vector per clock throughput on modern Intel CPUs, you need the loop to be unrolled (see below for why non-unrolled asm can't work). Having the compiler do it for you is probably good (instead of doing it by hand in the C++ source). e.g. use profile-guided optimization (gcc -fprofile-use), or just blindly use -funroll-loops.


16 bytes per clock is enough in theory to saturate main memory bandwidth with one core. However, IIRC Z Boson has observed better bandwidth from using multiple cores, probably because multiple cores keep more requests outstanding, and a stall on one core doesn't leave memory idle. If the input is hot in L2 on a core, though, it's probably best to process the data using that core.

On Haswell or later, 16 bytes loaded and stored per clock is only half of L1 cache bandwidth, so you need an AVX version of this to max per-core bandwidth.

If you bottleneck on memory, you might want to do a Newton-Raphson iteration to get a nearly-full accuracy 1/sqrt(x), especially if you use multiple threads for a big array. (Because then it's ok if a single thread can't sustain one load+store per clock.)

Or maybe just use rsqrt on the fly when loading this data later. It's very cheap, with high throughput, but still latency similar to an FP add. Again, if it's a big array that doesn't fit in cache, increasing computational intensity by doing fewer separate passes over the data is a big deal. (Cache Blocking aka Loop Tiling would also be a good idea, if you can do so: run multiple steps of your algorithm on a cache-sized chunk of your data.)

Only use cache-bypassing NT stores as a last resort if you can't find a way to make effective use of cache. It's much better if you can convert some data that you're about to use, so it's in cache when it's next used.


The main loop (from .L31 to jne .L31 on the Godbolt compiler explorer) is 6 uops for Intel SnB-family CPUs, because indexed addressing modes don't micro-fuse. (This isn't yet documented in Agner Fog's microarch pdf, unfortunately.)

It's 4 fused-domain uops on Nehalem, with only three ALU uops, so Nehalem should run this at 1 per clock.

.L31:    # the main loop: 6 uops on SnB-family, 4 uops on Nehalem
    rsqrtps xmm0, XMMWORD PTR [rbx+rax]       # tmp127, MEM[base: ar_in_now_10, index: ivtmp.51_61, offset: 0B]
    movaps  XMMWORD PTR [rbp+0+rax], xmm0       # MEM[base: ar_out_now_12, index: ivtmp.51_61, offset: 0B], tmp127
    add     rax, 16   # ivtmp.51,
    cmp     rax, 100000       # ivtmp.51,
    jne     .L31      #,

Since you want to write a separate destination, there's no way to get the loop down to 4 fused-domain uops so it can run at one vector per clock without unrolling. (Both the load and the store need to be one-register addressing modes, so the trick of using src - dst indexed by current_dst instead of incrementing src doesn't work).

Modifying your C++ so gcc would use a pointer increment would only save one uop, because you have to increment src and dst. i.e. float *endp = start + length; and for (p = start ; p < endp ; p+=4) {} would make a loop like

.loop:
    rsqrtps  xmm0, [rsi]
    add      rsi, 16
    movaps   [rdi], xmm0
    add      rdi, 16
    cmp      rdi, rbx
    jne      .loop

Hopefully gcc will do something like this while unrolling, otherwise the rsqrtps + movaps will be 4 fused-domain uops on their own if they still use indexed addressing mode, and no amount of unrolling will make your loop run at one vector per clock.



回答2:

Because this is a streaming calculation with very low arithmetic intensity, you are almost certainly memory bound. You will likely find a measurable speedup if you use non-temporal loads and stores.

// Hint to the CPU that we don't want to use the cache
// Be sure to update this if you use manual loop unrolling
_mm_prefetch(reinterpret_cast<char*>(ar_in+4), _MM_HINT_NTA);

// Hint to the CPU that we don't need to write back through the cache
_mm_stream_ps(ar_out_now,xmm0);

EDIT:

I ran some tests to see what things look like on different hardware. Unsurprisingly, the results are quite sensitive to the hardware in use. I would posit that it is likely due to the increased number of read/write buffers in modern CPUs.

All code was compiled using gcc-6.1 with

gcc -std=c++14 -O3 -march=native -mavx -mfpmath=sse -ffast-math

Intel Core i3-3120M @ 2.5GHz; 3MB cache

OP's code:               350 +- 10 milliseconds
NTA Prefetch:            360 +- 5 milliseconds
NTA Prefetch+NTA store:  430 +- 10 milliseconds

Intel Core i7-6500U CPU @ 2.50GHz; 4MB cache

OP's code:               205 +- 5 milliseconds
NTA Prefetch:            220 +- 2 milliseconds
NTA Prefetch+NTA store:  200 +- 5 milliseconds

Increasing the problem size to 2MB, the NTA Prefetch+NTA store sees a ~30% decrease in runtime over OP's solution.

Results: The problem size is too small to substantially benefit from NTA. On older architectures, it is detrimental. On newer architectures, it is on par with the OP's solution.

Conclusion: Probably not worth the extra effort in this case.



回答3:

You have to measure it, of course, but there is known code to compute (not very precise) inverse square root, check https://www.beyond3d.com/content/articles/8/

float InvSqrt (float x) {
    float xhalf = 0.5f*x;
    int i = *(int*)&x;
    i = 0x5f3759df - (i>>1);
    x = *(float*)&i;
    x = x*(1.5f - xhalf*x*x);
    return x;
}

Tested (with VS2015 and GCC 5.4.0) conversion to SSE2, link

__m128 InvSqrtSSE2(__m128 x) {
    __m128 xhalf = _mm_mul_ps(x, _mm_set1_ps(0.5f));

    x = _mm_castsi128_ps(_mm_sub_epi32(_mm_set1_epi32(0x5f3759df), _mm_srai_epi32(_mm_castps_si128(x), 1)));

    return _mm_mul_ps(x, _mm_sub_ps(_mm_set1_ps(1.5f), _mm_mul_ps(xhalf, _mm_mul_ps(x, x))));
}

UPDATE I

Mea culpa! Thanks to @EOF, who noticed improper conversion, those are replaced with casts