Can anybody explain/understand the different of the calculation result in single / multi-threaded mode?
Here is an example of approx. calculation of pi:
#include <iomanip>
#include <cmath>
#include <ppl.h>
const int itera(1000000000);
int main()
{
printf("PI calculation \nconst int itera = 1000000000\n\n");
clock_t start, stop;
//Single thread
start = clock();
double summ_single(0);
for (int n = 1; n < itera; n++)
{
summ_single += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
};
stop = clock();
printf("Time single thread %f\n", (double)(stop - start) / 1000.0);
//Multithread with OMP
//Activate OMP in Project settings, C++, Language
start = clock();
double summ_omp(0);
#pragma omp parallel for reduction(+:summ_omp)
for (int n = 1; n < itera; n++)
{
summ_omp += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
};
stop = clock();
printf("Time OMP parallel %f\n", (double)(stop - start) / 1000.0);
//Multithread with Concurrency::parallel_for
start = clock();
Concurrency::combinable<double> piParts;
Concurrency::parallel_for(1, itera, [&piParts](int n)
{
piParts.local() += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
});
double summ_Conparall(0);
piParts.combine_each([&summ_Conparall](double locali)
{
summ_Conparall += locali;
});
stop = clock();
printf("Time Concurrency::parallel_for %f\n", (double)(stop - start) / 1000.0);
printf("\n");
printf("pi single = %15.12f\n", std::sqrt(summ_single));
printf("pi omp = %15.12f\n", std::sqrt(summ_omp));
printf("pi comb = %15.12f\n", std::sqrt(summ_Conparall));
printf("\n");
system("PAUSE");
}
And the results:
PI calculation VS2010 Win32
Time single thread 5.330000
Time OMP parallel 1.029000
Time Concurrency:arallel_for 11.103000
pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592651497
PI calculation VS2013 Win32
Time single thread 5.200000
Time OMP parallel 1.291000
Time Concurrency:arallel_for 7.413000
pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592647841
PI calculation VS2010 x64
Time single thread 5.190000
Time OMP parallel 1.036000
Time Concurrency::parallel_for 7.120000
pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592649319
PI calculation VS2013 x64
Time single thread 5.230000
Time OMP parallel 1.029000
Time Concurrency::parallel_for 5.326000
pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592648489
The tests were made on AMD and Intel CPUs, Win 7 x64.
What is the reason for difference between PI calculation in single and multicore? Why the result of calculation with Concurrency::parallel_for is not constant on different builds (compiler, 32/64 bit platform)?
P.S. Visual studio express doesn’t support OpenMP.
So... In win32 mode the FPU with 80bit registers will be used. In x64 mode the SSE2 with double precision float (64 bit) will be used. The use of sse2 seems like be by default in x64 mode.
Theoretically... is it possible that the calculation in win32 mode will be more precise? :) http://en.wikipedia.org/wiki/SSE2 So the better way is to buy new CPUs with AVX or compile to 32bit code?...
Floating-point addition is a non-associative operation due to round-off errors, therefore the order of operations matters. Having your parallel program give different result(s) than the serial version is something normal. Understanding and dealing with it is part of the art of writing (portable) parallel codes. This is exacerbated in the 32- against 64-bit builds since in 32-bit mode the VS compiler uses x87 instructions and the x87 FPU does all operations with an internal precision of 80 bits. In 64-bit mode SSE math is used.
In the serial case, one thread computes s1+s2+...+sN, where N is the number of terms in the expansion.
In the OpenMP case there are n partial sums, where n is the number of OpenMP threads. Which terms get into each partial sum depends on the way iterations are distributed among the threads. The default for many OpenMP implementations is static scheduling, which means that thread 0 (the main thread) computes ps0 = s1 + s2 + ... + sN/n; thread 1 computes ps1 = sN/n+1 + sN/n+2 + ... + s2N/n; and so on. In the end the reduction combines somehow those partial sums.
The
parallel_for
case is very similar to the OpenMP one. The difference is that by default the iterations are distributed in a dynamic fashion - see the documentation forauto_partitioner
, therefore each partial sum contains a more or less random selection of terms. This not only gives a slightly different result, but it also gives a slightly different result with each execution, i.e. the result from two consecutiveparallel_for
's with the same number of threads might differ a bit. If you replace the partitioner with an instance ofsimple_partitioner
and set the chunk size equal toitera / number-of-threads
, you should get the same result as in the OpenMP case if the reduction is performed the same way.You could use Kahan summation and implement your own reduction also using Kahan summation. Then the parallel codes should produce the same (over much more similar) result as the serial one.
I would guess that the parallel reduction that openmp does is in general more accurate as the floating point addition round-off error gets more distributed. In general floating point reductions are problematic because of rounding errors etc. http://floating-point-gui.de/ performing those operations in parallel is a way to improve on the accuracy by distributing the rounding error. Imagine you are doing a big reduction, at some point the accumulator is going to grow in size compared to the other values and this will increase the rounding error for each addition as the accumulators range is much larger and it may not be possible to represent the value of the smaller value in that range accurately, however if there are multiple accumulators for the same reduction operating in parallel their magnitudes would remain smaller and this kind of error would be smaller.