Modulo 2*Pi using SSE/SSE2

I'm still pretty new to using SSE and am trying to implement a modulo of 2*Pi for double-precision inputs of the order 1e8 (the result of which will be fed into some vectorised trig calculations).

My current attempt at the code is based around the idea that mod(x, 2*Pi) = x - floor(x/(2*Pi))*2*Pi and looks like:

#define _PD_CONST(Name, Val)                                            \
static const double _pd_##Name[2] __attribute__((aligned(16))) = { Val, Val }  

_PD_CONST(2Pi, 6.283185307179586);  /* = 2*pi  */  
_PD_CONST(recip_2Pi, 0.159154943091895); /* = 1/(2*pi)  */

void vec_mod_2pi(const double * vec, int Size, double * modAns)
{
    __m128d sse_a, sse_b, sse_c;
    int i;
    int k = 0;
    double t = 0;

    unsigned int initial_mode;
    initial_mode = _MM_GET_ROUNDING_MODE();

    _MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);

    for (i = 0; i < Size; i += 2)
    {
        sse_a = _mm_loadu_pd(vec+i);
        sse_b = _mm_mul_pd( _mm_cvtepi32_pd( _mm_cvtpd_epi32( _mm_mul_pd(sse_a, *(__m128d*)_pd_recip_2Pi) ) ), *(__m128d*)_pd_2Pi);
        sse_c = _mm_sub_pd(sse_a, sse_b);
        _mm_storeu_pd(modAns+i,sse_c);
    }

    k = i-2;
    for (i = 0; i < Size%2; i++)
    {
        t = (double)((int)(vec[k+i] * 0.159154943091895)) * 6.283185307179586;
        modAns[k+i] = vec[k+i] - t;
    }

    _MM_SET_ROUNDING_MODE(initial_mode);
}

Unfortunately, this is currently returning a lot of NaN with a couple of answers of 1.128e119 as well (some what outside the range of 0 -> 2*Pi that I was aiming for!). I suspect that where I'm going wrong is in the double-to-int-to-double conversion that I'm trying to use to do the floor.

Can anyone suggest where I've gone wrong and how to improve it?

P.S. sorry about the format of that code, it's the first time I've posted a question on here and can't seem to get it to give me empty lines within the code block to make it readable.

回答1:

If you want any kind of accuracy, the simple algorithm is terribly bad. For an accurate range reduction algorithm, see e.g. Ng et al., ARGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit (now available via the Wayback Machine: 2012-12-24)

回答2:

For large arguments Hayne-Panek algorithm is typically used. However, the Hayne-Panek paper is quite difficult to read, and I suggest to have a look at Chapter 11 in the Handbook of Floating-Point Arithmetic for a more accessible explanation.