Using % with SSE2?

Here's the code I'm trying to convert to SSE2:

double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
double *left = audioLeft;
double *right = audioRight;
double phase = 0.0;
double bp0 = mNoteFrequency * mHostPitch;

for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
    // some other code (that will use phase)

    phase += std::clamp(mRadiansPerSample * (bp0 * pB[sampleIndex] + pC[sampleIndex]), 0.0, PI);

    while (phase >= TWOPI) { phase -= TWOPI; }
}

Here's what I've achieved:

double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
double *left = audioLeft;
double *right = audioRight;
double phase = 0.0;
double bp0 = mNoteFrequency * mHostPitch;

__m128d v_boundLower = _mm_set1_pd(0.0);
__m128d v_boundUpper = _mm_set1_pd(PI);
__m128d v_bp0 = _mm_set1_pd(bp0);
__m128d v_radiansPerSample = _mm_set1_pd(mRadiansPerSample);

__m128d v_phase = _mm_set1_pd(phase);
__m128d v_pB = _mm_load_pd(pB);
__m128d v_pC = _mm_load_pd(pC);
__m128d v_result = _mm_mul_pd(v_bp0, v_pB);
v_result = _mm_add_pd(v_result, v_pC);
v_result = _mm_mul_pd(v_result, v_radiansPerSample);
v_result = _mm_max_pd(v_result, v_boundLower);
v_result = _mm_min_pd(v_result, v_boundUpper);

for (int sampleIndex = 0; sampleIndex < roundintup8(blockSize); sampleIndex += 8, pB += 8, pC += 8) {
    // some other code (that will use v_phase)

    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 2);
    v_pC = _mm_load_pd(pC + 2);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);
    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 4);
    v_pC = _mm_load_pd(pC + 4);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);
    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 6);
    v_pC = _mm_load_pd(pC + 6);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);
    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 8);
    v_pC = _mm_load_pd(pC + 8);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);

    // ... fmod?
}

But I'm not really sure how to replace while (phase >= TWOPI) { phase -= TWOPI; } (which is basically a classic fmod in C++).

Any fancy intrinsics? Can't find any on this list. Division + some sort of rocket bit-shifting?

As comments are saying, it looks like in this you can make it just a masked subtract with a compare + andpd. This works as long as you can never be more than one subtract away from getting back into the desired range.

const __m128d v2pi = _mm_set1_pd(TWOPI);


__m128d needs_range_reduction = _mm_cmpge_pd(vphase, v2pi);
__m128d offset = _mm_and_pd(needs_range_reduction, v2pi);  // 0.0 or 2*Pi
vphase = _mm_sub_pd(vphase, offset);

To implement an actual (slow) fmod without worrying too much about the last few bits of the significand, you'd do integer_quotient = floor(x/y) (or maybe rint(x/y) or ceil), then x - y * integer_quotient. floor / rint / ceil are cheap with SSE4.1 _mm_round_pd or _mm_floor_pd(). That will give you the remainder, which can be negative just like with integer division.

I'm sure there are numerical techniques that better avoid rounding error before the catastrophic cancellation from subtracting two nearby numbers. If you care about precision, go check. (Using double vectors when you don't care much about precision is kind of silly; might as well use float and get twice as much work done per vector). If the input is a lot larger than the modulus, there's an unavoidable loss of precision, and minimizing rounding error in the temporary is probably very important. But otherwise precision will only be a problem unless you care about relative error in results very close to zero when x is almost an exact multiple of y. (Near-zero result, the only the bottom few bits of the significand are left for precision.)

Without SSE4.1, there are tricks like adding then subtracting a large enough number. Converting to integer and back is even worse for pd, because the packed-conversion instruction decodes to some shuffle uops as well. Not to mention that a 32-bit integer doesn't cover the full range of double, but you're screwed for range-reduction precision if your input was that huge.

If you have FMA, you can avoid rounding error in the y * integer_quotient part of the multiply and sub. _mm_fmsub_pd.