I was trying to optimize following code (sum of squared differences for two arrays):
inline float Square(float value)
{
return value*value;
}
float SquaredDifferenceSum(const float * a, const float * b, size_t size)
{
float sum = 0;
for(size_t i = 0; i < size; ++i)
sum += Square(a[i] - b[i]);
return sum;
}
So I performed optimization with using of SSE instructions of CPU:
inline void SquaredDifferenceSum(const float * a, const float * b, size_t i, __m128 & sum)
{
__m128 _a = _mm_loadu_ps(a + i);
__m128 _b = _mm_loadu_ps(b + i);
__m128 _d = _mm_sub_ps(_a, _b);
sum = _mm_add_ps(sum, _mm_mul_ps(_d, _d));
}
inline float ExtractSum(__m128 a)
{
float _a[4];
_mm_storeu_ps(_a, a);
return _a[0] + _a[1] + _a[2] + _a[3];
}
float SquaredDifferenceSum(const float * a, const float * b, size_t size)
{
size_t i = 0, alignedSize = size/4*4;
__m128 sums = _mm_setzero_ps();
for(; i < alignedSize; i += 4)
SquaredDifferenceSum(a, b, i, sums);
float sum = ExtractSum(sums);
for(; i < size; ++i)
sum += Square(a[i] - b[i]);
return sum;
}
This code works fine if the size of the arrays is not too large. But if the size is big enough then there is a large computing error between results given by base function and its optimized version. And so I have a question: Where is here a bug in SSE optimized code, which leads to the computing error.
The error follows from finite precision floating point numbers. Each addition of two floating point numbers is has an computing error proportional to difference between them. In your scalar version of algorithm the resulting sum is much greater then each term (if size of arrays is big enough of course). So it leads to accumulation of big computing error.
In the SSE version of algorithm actually there is four sums for results accumulation. And difference between these sums and each term is lesser in four times relative to scalar code. So this leads to the lesser computing error.
There are two ways to solve this error:
1) Using of floating point numbers of double precision for accumulating sum.
2) Using of the the Kahan summation algorithm (also known as compensated summation) which significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach.
https://en.wikipedia.org/wiki/Kahan_summation_algorithm
With using of Kahan summation algorithm your scalar code will look like:
And SSE optimized code will look as follow: