Here's the sample C code that I am trying to accelerate using SSE, the two arrays are 3072 element long with doubles, may drop it down to float if i don't need the precision of doubles.
double sum = 0.0;
for(k = 0; k < 3072; k++) {
sum += fabs(sima[k] - simb[k]);
}
double fp = (1.0 - (sum / (255.0 * 1024.0 * 3.0)));
Anyway my current problem is how to do the fabs step in a SSE register for doubles or float so that I can keep the whole calculation in the SSE registers so that it remains fast and I can parallelize all of the steps by partly unrolling this loop.
Here's some resources I've found fabs() asm or possibly this flipping the sign - SO however the weakness of the second one would need a conditional check.
I suggest using bitwise and with a mask. Positive and negative values have the same representation, only the most significant bit differs, it is 0 for positive values and 1 for negative values, see double precision number format. You can use one of these:
Also, it might be a good idea to unroll the loop to break the loop-carried dependency chain. Since this is a sum of nonnegative values, the order of summation is not important:
By unrolling and breaking the dependency (sum1 and sum2 are now independent), you let the processor execute the additions our of order. Since the instruction is pipelined on a modern CPU, the CPU can start working on a new addition before the previous one is finished. Also, bitwise operations are executed on a separate execution unit, the CPU can actually perform it in the same cycle as addition/subtraction. I suggest Agner Fog's optimization manuals.
Finally, I don't recommend using openMP. The loop is too small and the overhead of distribution the job among multiple threads might be bigger than any potential benefit.
Probably the easiest way is as follows:
Note that this may not be any faster than scalar code on modern x86 CPUs, which typically have two FPUs anyway. However if you can drop down to single precision then you may well get a 2x throughput improvement.
Note also that you will need to combine the two partial sums in
vsum
into a scalar value after the loop, but this is fairly trivial to do and is not performance-critical.The maximum of -x and x should be abs(x). Here it is in code: