I have two functions of 2d arrays multiplication. One of them with SSE. Another function without any optimization. Both functions work well. But the results are slightly different. For example 20.333334 and 20.333332.
Can you explain why the results are different? And what can I do with functions to have the same result?
function with SSE
float** sse_multiplication(float** array1, float** array2, float** arraycheck)
{
int i, j, k;
float *ms1, *ms2, result;
float *end_loop;
for( i = 0; i < rows1; i++)
{
for( j = 0; j < columns2; j++)
{
result = 0;
ms1 = array1[i];
ms2 = array2[j];
end_loop = &array1[i][columns1];
__asm{
mov rax, ms1
mov rbx, ms2
mov rdx, end_loop
xorps xmm2, xmm2
loop:
movups xmm0, [rax]
movups xmm1, [rbx]
movups xmm3, [rax+16]
movups xmm4, [rbx+16]
mulps xmm0, xmm1
mulps xmm3, xmm4
addps xmm2, xmm0
add rax, 32
add rbx, 32
cmp rdx, rax
jne loop
haddps xmm2, xmm2
haddps xmm2, xmm2
movups result, xmm2
}
arraycheck[i][j] = result;
}
}
return arraycheck;
}
function without any optimization
float** multiplication(float** array1, float** array2, float** arraycheck)
{
for (int i = 0; i < rows1; i++)
for (int j = 0; j < columns2; j++)
for (int k = 0; k < rows1; k++)
arraycheck[i][j] += array1[i][k] * array2[k][j];
return arraycheck;
}
FP addition is not perfectly associative, so a different order of operations produces slightly different rounding errors.
Your C sums elements in order. (Unless you use -ffast-math
to allow the compiler to make the same assumption you did, that FP operations are close enough to associative).
Your asm sums up every 4th element at 4 different offsets, then horizontally sums those. The sum in each vector element is rounded differently at each point.
Your vectorized version doesn't seem to match the C version. The indexing looks different. AFAICT, the only sane way to vectorize arraycheck[i][j] += array1[i][k] * array2[k][j];
is over j
. Looping over k
would require strided loads from array2
, and looping over i
would require strided loads from array1
.
Am I missing something about your asm? It's loading contiguous values from both arrays. It's also throwing away the mulps
result in xmm3
every iteration of loop
, so I think it's just buggy.
Since looping over j
in the inner vector loop doesn't change array1[i][k]
, just broadcast-load it once outside the loop (_mm256_set1_ps
).
However, that means doing a read-modify-write of arraycheck[i][j]
for every different j
value. i.e. ac[i][j + 0..3] = fma(a1[i][k], a2[k][j + 0..3], ac[i][j + 0..3])
. To avoid this, you'd have to transpose one of the arrays first. (But that's O(N^2) for an NxN matrix, which is still cheaper than multiply).
This way doesn't use a horizontal sums, but see that link if you want better code for that.
It does operations in the same order as the scalar C, so results should match exactly.
Also note that you need to use more than one accumulator to saturate the execution units of a CPU. I'd suggest 8, to saturate Skylake's 4c latency, one per 0.5c throughput addps
. Haswell has 3c latency, one per 1c addps
, but Skylake dropped the separate FP add unit and does it in the FMA unit. (See the x86 tag wiki, esp. Agner Fog's guides)
Actually, since my suggested change doesn't use a single accumulator at all, every loop iteration accesses independent memory. You'll need a bit of loop unrolling to saturate the FP execution units with two loads and store in the loop (even though you only need two pointers, since the store is back into the same location as one of the loads). But anyway, if your data fits in L1 cache, out-of-order execution should keep the execution units pretty well supplied with work from separate iterations.
If you really care about performance, you'll make an FMA version, and maybe an AVX-without-FMA version for Sandybridge. You could be doing two 256b FMAs per clock instead of one 128b add and mul per clock. (And of course you're not even getting that, because you bottleneck on latency unless the loop is short enough for the out-of-order window to see independent instructions from the next iteration).
You're going to need "loop tiling", aka "cache blocking" to make this not suck for big problem sizes. This is a matrix multiply, right? There are very good libraries for this, which are tuned for cache sizes, and will beat the pants off a simple attempt like this. e.g. ATLAS was good last time I checked, but that was several years ago.
Use intrinsics, unless you write the whole function in asm. Compilers "understand" what they do, so can make nice optimizations, like loop unrolling when appropriate.
According to IEEE standard Formats, 32-bit float can only guanartee 6-7 digits accuracy. Your error is so marginal that no plausible claim can be made on compiler's mechanism. If you want to achieve better precision, it would be wise to choose either 64-bit double(guarentees 15 digits accuracy) or implement your own BigDecimal class like java do.