Can someone please explain me why the sequential version π-approximation was faster than the parallel one?
I can't figure it out
I'm playing around with using a very well-known π-approximation example. I pick random points in the unit square ( ( 0, 0 ) to ( 1, 1 ) ) and see how many of random points do fall inside the area of unit circle. The fraction should be the value of π / 4.
public class PIEstimation {
final static int NUM_SAMPLES = 100000000;
public static void main(String[] args) {
sequentialVersion();
parallelVersion();
System.out.println(" Real PI:= " + Math.PI);
}
public static void sequentialVersion() {
final long start = System.nanoTime();
final long count = LongStream
.rangeClosed(1, NUM_SAMPLES)
.filter(e -> {
double x = Math.random();
double y = Math.random();
return x * x + y * y < 1;
}).count();
final long duration = ((System.nanoTime() - start) / 1_000_000);
System.out.println("Sequential Version: PI ~ " + 4.0 * (count / (double) NUM_SAMPLES) + " calculated in "
+ duration + " msecs");
}
public static void parallelVersion() {
final long start = System.nanoTime();
final long count = LongStream
.rangeClosed(1, NUM_SAMPLES)
.parallel()
.filter(e -> {
double x = Math.random();
double y = Math.random();
return x * x + y * y < 1;
}).count();
final long duration = ((System.nanoTime() - start) / 1_000_000);
System.out.println(" Parallel Version: PI ~ " + 4.0 * (count / (double) NUM_SAMPLES) + " calculated in "
+ duration + " msecs");
}
}
The results:
Sequential Version: PI ~ 3.14176568 calculated in 4893 msecs
Parallel Version: PI ~ 3.1417546 calculated in 12044 msecs
Real PI:= 3.141592653589793
I get even worse results running in parallel on my machine (3.0 GHz Intel Core i7, two cores, four threads):
I suspect the main reason is that
Math.random()
is thread-safe, and so it synchronizes around every call. Since there are multiple threads all trying to get random numbers at the same time, they're all contending for the same lock. This adds a tremendous amount of overhead. Note that the specification forMath.random()
says the following:To avoid lock contention, use
ThreadLocalRandom
instead:This gives the following results:
which is 1.8x speedup, not too bad for a two-core machine. Note that this is also faster when run sequentially, probably because there's no lock overhead at all.
Aside: Normally for benchmarks I'd suggest using JMH. However, this benchmark seems to run long enough that it gives a reasonable indication of relative speeds. For more precise results, though, I do recommend using JMH.
UPDATE
Here are additional results (requested by user3666197 in comments), using a
NUM_SAMPLES
value of1_000_000_000
compared to the original100_000_000
. I've copied the results from above for easy comparison.The
sequential
andparallel
results are (mostly) the same code as in the question, andsequential2
andparallel2
are using my modifiedThreadLocalRandom
code. The new timings are overall roughly 10x longer, as one would expect. The longerparallel2
run isn't quite as fast as one would expect, though it's not totally out of line, showing about a 1.6x speedup on a two-core machine.Prologue:
if The Motivation for going parallel was indeed a higher performance, a cautious design practice shall validate a use of the just-enough tools without adding any unreasonable steps
( the overhead-strict Amdahl Law formulation shows the cruel penalties if not obeying this design practice -- resulting in slower parallel code-execution, compared to ordinary serial code-execution. The reason is in the very add-on overhead cost, not the nature of the parallel execution. Detailed experimental data confirm this in-vivo ( re-test-able below ). Q.E.D. )
Why a Sequential version may get faster than a Parallel?
Depends a lot on implementation details.
A platform code-execution timings were observed to bear a noise of about
+/- 100 [ms]
forNUM_SAMPLES ~ 1E+8
dart-throws. May Ctrl+C / Ctrl+V / RUN all listed code-approaches for testing different implementations and to be able re-test all approaches with a chance of adding any better alterations from here. N.b.: Running all tests makes sense only if and only if all of them are being run on the very same platform ( be it yourlocalhost
, your preferred TEST-infrastructure or the "here" -- that does not matter per-se, but being it the same platform matters if the observed results have an ambition to become comparable each to the other ( apples to apples, not the apples to oranges ) ) Tests self-document performance, just run 'emOn all parallelisation-ready platforms, there is a risk of a resource-contention, if the process is heavily dependent on a source-of-randomness ( we call it twice per a dart-shot onto PI-boundary experiment ) and such platform implements this peculiar-magic-box ( because of the cryptographically required pure-[SERIAL]-( thus non-parallelisable )-feature of such source-of-randomness ) hardwired onto some hardware element ( i.e. even infinitely parallelisable process will stand in a pure [SEQ]-queue, for getting a cryptographically-robust next-random number ). If some platform can generate massively-parallel random numbers, typically such platform instantiates so many "independent"-sources-of-randomness as necessary, so as to perform in-stream cryptographically-robust sequence of random numbers, but this requires some other approach than a "shared"-sequence-pooling from
Math
using a call to its.random()
-methodWhy
LongStream
version was almost twice slower than a naive loop?First, let me note, that using any high-level abstraction is the first risk of loosing controls over efficient parallelisation. Yes,
LongStream
methods are pretty attractive to implement some scheme on a few SLOCs, but here, the dumb-level code setup shows, that you systematically loose about ~ 1110 - 1200 [ms] to stream-based processing:What must always know before making a
[SEQ] | [PAR]
decision?Such a decision is neither toss of a coin, nor an easy & straightforward step. We hear many times about immense powers of parallel-computing but the Mother Nature does not let us gain without pain.
Gene AMDAHL has formulated a law, which demonstrates speedups possible for a process, if more processors get harnessed.
The story is not trivial even in the simplified Amdahl's Law formulation ( details will be put later ).
Given a process-
P
, we may assume,for every
P
there is always a part scheduled and executed in a[SERIAL]
-way ( let's call it a[SEQ]
-part ) and similarlyfor every
P
there may be one or more parts ofP
, that may get executed in a[PARALLEL]
-fashion ( as resources and mutual-(mis)-coordination permit ).Amdahl has put such pure-
[SERIAL]
process-execution into comparison with a possibly[SEQ]+[PAR]
process-execution, using the former-part as is and gaining the possible speedup just from the latter-part, by doing a split amongN
-processors, the speedupS
is:Even this simplified formulation immediately warns, that:
[SEQ]
-part took, the less speedup the[PAR]
-part may deliver by split-onto any amount ofN > 1
So the first output for one's decision making is -- isolate and benchmark the [SEQ]-part and [PAR]-part, so as to get a first approximation of what levels of speedup are principally set ( for even infinite amount of processors )
Next comes the hard part of the truth:
There is no such thing as a free dinner. The Mother Nature steps in, there is no gain without pain.
One can always spend some reasonable amount of science on re-factoring the [SEQ]-part, which yields immediate effects on the overall process-duration. Theory of
[?TIME][?SPACE]
-complexity (? == { c-onst | p-olynomial | exp-onential }
) principally introduces potential costs and savings achievable in these efforts. Benchmarks, meaning REALISTIC benchmarks will tell the difference. What could work nice and smooth on in-cache-sizes, will choke and progress much slower on beyond-cache-size scales, etc., etc. So, always design REALISTIC benchmarks at scales, that remain relevant for the target process-scaling, otherwise poor decisions come from a table of such a scaling-naive experimentator.Next, one can harness some parallel-computing tools, so as to get the
[PAR]
-part executed faster on some available sub-set of processors.Given a [SEQ]-part was squashed close to the silicon-limits, and the
[PAR]
-part duration remains of a reasonable portion of the whole original process-duration, the[PAR]
-tools may indeed help --- BUT AT A COST --- yes, one has always pay some cost of "arranging" all the toys so as to start executing in[PAR]
. Similarly, at the end of the [PAR]-scheduled code-execution, there is another step, to somehow collect the results, deactivate / terminate the individual [PAR]-executions and let these results delivered back into the [SEQ]-part of the code, where the ( hopefully ) "accelerated" results get further used.All this is called an overhead. The overhead, let's call it
o
, remains hidden from one's eyes in the simplified ( overhead-naive ) Amdahl's Law formulation, but the real-world execution will always demonstrate this add-on part in benchmarking, as the code cannot "skip" it and somehow exhibit a magic instruction for a direct "jump"-from-[SEQ]-into-[PAR]
and a "jump"-from-[PAR]-back-to-[SEQ]
, such things simply do not exist...Overhead-strict formulation of the Amdahl's Law speedup
S
:Reading the both formulations, one need not be any mathematical genius to realise, that it is fairly enough if one will have to principally pay some expensive costs of
( pSO + pTO )
to then realise the[PARALLEL]
-version will get slower, not faster than a pure[SERIAL]
-version of a process, that was naively expected to get theoretically accelerated, even if infinite amount of processors will get harnessed.Q.E.D.
In the Epilogue section of this, there is an easy to experiment interactive + animated toy-tool with sliders for both
o
,p == ( 1 - s )
, to evaluate the effects of overhead-strict Amdahl's Law for one's individual[SEQ]-[PAR]
-process compositions. Feel free to click-through and experiment on your own.This will help you test, when principal
[PAR]
-overheads start to get at least adjusted byN
-CPU-[PARALLEL]
-execution and when any such efforts remain wasted in just a principally worse performance, than a[SERIAL]
-process execution, which is always, in this sense, an overhead-free and cheap competitor.