Here's the code I'm trying to convert to SSE2:
double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
double *left = audioLeft;
double *right = audioRight;
double phase = 0.0;
double bp0 = mNoteFrequency * mHostPitch;
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
// some other code (that will use phase)
phase += std::clamp(mRadiansPerSample * (bp0 * pB[sampleIndex] + pC[sampleIndex]), 0.0, PI);
while (phase >= TWOPI) { phase -= TWOPI; }
}
Here's what I've achieved:
double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
double *left = audioLeft;
double *right = audioRight;
double phase = 0.0;
double bp0 = mNoteFrequency * mHostPitch;
__m128d v_boundLower = _mm_set1_pd(0.0);
__m128d v_boundUpper = _mm_set1_pd(PI);
__m128d v_bp0 = _mm_set1_pd(bp0);
__m128d v_radiansPerSample = _mm_set1_pd(mRadiansPerSample);
__m128d v_phase = _mm_set1_pd(phase);
__m128d v_pB = _mm_load_pd(pB);
__m128d v_pC = _mm_load_pd(pC);
__m128d v_result = _mm_mul_pd(v_bp0, v_pB);
v_result = _mm_add_pd(v_result, v_pC);
v_result = _mm_mul_pd(v_result, v_radiansPerSample);
v_result = _mm_max_pd(v_result, v_boundLower);
v_result = _mm_min_pd(v_result, v_boundUpper);
for (int sampleIndex = 0; sampleIndex < roundintup8(blockSize); sampleIndex += 8, pB += 8, pC += 8) {
// some other code (that will use v_phase)
v_phase = _mm_add_pd(v_phase, v_result);
v_pB = _mm_load_pd(pB + 2);
v_pC = _mm_load_pd(pC + 2);
v_result = _mm_mul_pd(v_bp0, v_pB);
v_result = _mm_add_pd(v_result, v_pC);
v_result = _mm_mul_pd(v_result, v_radiansPerSample);
v_result = _mm_max_pd(v_result, v_boundLower);
v_result = _mm_min_pd(v_result, v_boundUpper);
v_phase = _mm_add_pd(v_phase, v_result);
v_pB = _mm_load_pd(pB + 4);
v_pC = _mm_load_pd(pC + 4);
v_result = _mm_mul_pd(v_bp0, v_pB);
v_result = _mm_add_pd(v_result, v_pC);
v_result = _mm_mul_pd(v_result, v_radiansPerSample);
v_result = _mm_max_pd(v_result, v_boundLower);
v_result = _mm_min_pd(v_result, v_boundUpper);
v_phase = _mm_add_pd(v_phase, v_result);
v_pB = _mm_load_pd(pB + 6);
v_pC = _mm_load_pd(pC + 6);
v_result = _mm_mul_pd(v_bp0, v_pB);
v_result = _mm_add_pd(v_result, v_pC);
v_result = _mm_mul_pd(v_result, v_radiansPerSample);
v_result = _mm_max_pd(v_result, v_boundLower);
v_result = _mm_min_pd(v_result, v_boundUpper);
v_phase = _mm_add_pd(v_phase, v_result);
v_pB = _mm_load_pd(pB + 8);
v_pC = _mm_load_pd(pC + 8);
v_result = _mm_mul_pd(v_bp0, v_pB);
v_result = _mm_add_pd(v_result, v_pC);
v_result = _mm_mul_pd(v_result, v_radiansPerSample);
v_result = _mm_max_pd(v_result, v_boundLower);
v_result = _mm_min_pd(v_result, v_boundUpper);
// ... fmod?
}
But I'm not really sure how to replace while (phase >= TWOPI) { phase -= TWOPI; }
(which is basically a classic fmod
in C++).
Any fancy intrinsics? Can't find any on this list. Division + some sort of rocket bit-shifting?
As comments are saying, it looks like in this you can make it just a masked subtract with a compare +
andpd
. This works as long as you can never be more than one subtract away from getting back into the desired range.Like
To implement an actual (slow)
fmod
without worrying too much about the last few bits of the significand, you'd dointeger_quotient = floor(x/y)
(or mayberint(x/y)
orceil
), thenx - y * integer_quotient
.floor
/rint
/ceil
are cheap with SSE4.1_mm_round_pd
or_mm_floor_pd()
. That will give you the remainder, which can be negative just like with integer division.I'm sure there are numerical techniques that better avoid rounding error before the catastrophic cancellation from subtracting two nearby numbers. If you care about precision, go check. (Using
double
vectors when you don't care much about precision is kind of silly; might as well usefloat
and get twice as much work done per vector). If the input is a lot larger than the modulus, there's an unavoidable loss of precision, and minimizing rounding error in the temporary is probably very important. But otherwise precision will only be a problem unless you care about relative error in results very close to zero whenx
is almost an exact multiple ofy
. (Near-zero result, the only the bottom few bits of the significand are left for precision.)Without SSE4.1, there are tricks like adding then subtracting a large enough number. Converting to integer and back is even worse for
pd
, because the packed-conversion instruction decodes to some shuffle uops as well. Not to mention that a 32-bit integer doesn't cover the full range ofdouble
, but you're screwed for range-reduction precision if your input was that huge.If you have FMA, you can avoid rounding error in the
y * integer_quotient
part of the multiply and sub._mm_fmsub_pd
.