Floating point calculation is neither associative nor distributive on processors. So,
(a + b) + c
is not equal to a + (b + c)
and a * (b + c)
is not equal to a * b + a * c
Is there any way to perform deterministic floating point calculation that do not give different results. It would be deterministic on uniprocessor ofcourse, but it would not be deterministic in multithreaded programs if threads add to a sum for example, as there might be different interleavings of the threads.
So my question is, how can one achieve deterministic results for floating point calculations in multithreaded programs?
Edit: I've removed my old answer since I seem to have misunderstood OP's question. If you want to see it you can read the edit history.
I think the ideal solution would be to switch to having a separate accumulator for each thread. This avoids all locking, which should make a drastic difference to performance. You can simply sum the accumulators at the end of the whole operation.
Alternatively, if you insist on using a single accumulator, one solution is to use "fixed-point" rather than floating point. This can be done with floating-point types by including a giant "bias" term in your accumulator to lock the exponent at a fixed value. For example if you know the accumulator will never exceed 2^32, you can start the accumulator at
0x1p32
. This will lock you at 32 bits of precision to the left of the radix point, and 20 bits of fractional precision (assumingdouble
). If that's not enough precision, you could us a smaller bias (assuming the accumulator will not grow too large) or switch tolong double
. Iflong double
is 80-bit extended format, a bias of 2^32 would give 31 bits of fractional precision.Then, whenever you want to actually "use" the value of the accumulator, simply subtract out the bias term.
Use packed decimal.
Use a decimal type or library supporting such a type.
Try storing each intermediate result in a volatile object:
This is likely to have nasty effects on performance. I suggest measuring both versions.
EDIT: The purpose of
volatile
is to inhibit optimizations that might affect the results even in a single-threaded environment, such aschanging the order of operations orstoring intermediate results in wider registers. It doesn't address multi-threading issues.EDIT2: Something else to consider is that
This can be inhibited by using
Reference: C99 standard (large PDF), sections 7.12.2 and 6.5 paragraph 8. This is C99-specific; some compilers might not support it.
Even using a high-precision fixed point datatype would not solve the problem of making the results for said equations determinisic (except in certain cases). As Keith Tomposon pointed out in a comment, 1/3 is a trivial counter-example of a value that cannot be stored correctly in either a standard base-10 or base-2 floating point representation (regardless of precision or memory used).
One solution that, depending upon particular needs, may address this issue (it still has limits) is to use a Rational number data-type (one that stores both a numerator and denominator). Keith suggested GMP as one such library:
Whether it is suitable (or adequate) for this task is another story...
Happy coding.
Floating-point is deterministic. The same floating-point operations, run on the same hardware, always produces the same result. There is no black magic, noise, randomness, fuzzing, or any of the other things that people commonly attribute to floating-point. The tooth fairy does not show up, take the low bits of your result, and leave a quarter under your pillow.
Now, that said, certain blocked algorithms that are commonly used for large-scale parallel computations are non-deterministic in terms of the order in which floating-point computations are performed, which can result in non-bit-exact results across runs.
What can you do about it?
First, make sure that you actually can't live with the situation. Many things that you might try to enforce ordering in a parallel computation will hurt performance. That's just how it is.
I would also note that although blocked algorithms may introduce some amount of non-determinism, they frequently deliver results with smaller rounding errors than do naive unblocked serial algorithms (surprising but true!). If you can live with the errors produced by a naive serial algorithm, you can probably live with the errors of a parallel blocked algorithm.
Now, if you really, truly, need exact reproducibility across runs, here are a few suggestions that tend not to adversely affect performance too much:
Don't use multithreaded algorithms that can reorder floating-point computations. Problem solved. This doesn't mean you can't use multithreaded algorithms at all, merely that you need to ensure that each individual result is only touched by a single thread between synchronization points. Note that this can actually improve performance on some architectures if done properly, by reducing D$ contention between cores.
In reduction operations, you can have each thread store its result to an indexed location in an array, wait for all threads to finish, the accumulate the elements of the array in order. This adds a small amount of memory overhead, but is generally pretty tolerable, especially when the number of threads is "small".
Find ways to hoist the parallelism. Instead of computing 24 matrix multiplications, each one of which uses parallel algorithms, compute 24 matrix products in parallel, each one of which uses a serial algorithm. This, too, can be beneficial for performance (sometimes enormously so).
There are lots of other ways to handle this. They all require thought and care. Parallel programming usually does.