Can someone please explain me why the sequential version π-approximation was faster than the parallel one?
I can't figure it out
I'm playing around with using a very well-known π-approximation example. I pick random points in the unit square ( ( 0, 0 ) to ( 1, 1 ) ) and see how many of random points do fall inside the area of unit circle. The fraction should be the value of π / 4.
public class PIEstimation {
final static int NUM_SAMPLES = 100000000;
public static void main(String[] args) {
sequentialVersion();
parallelVersion();
System.out.println(" Real PI:= " + Math.PI);
}
public static void sequentialVersion() {
final long start = System.nanoTime();
final long count = LongStream
.rangeClosed(1, NUM_SAMPLES)
.filter(e -> {
double x = Math.random();
double y = Math.random();
return x * x + y * y < 1;
}).count();
final long duration = ((System.nanoTime() - start) / 1_000_000);
System.out.println("Sequential Version: PI ~ " + 4.0 * (count / (double) NUM_SAMPLES) + " calculated in "
+ duration + " msecs");
}
public static void parallelVersion() {
final long start = System.nanoTime();
final long count = LongStream
.rangeClosed(1, NUM_SAMPLES)
.parallel()
.filter(e -> {
double x = Math.random();
double y = Math.random();
return x * x + y * y < 1;
}).count();
final long duration = ((System.nanoTime() - start) / 1_000_000);
System.out.println(" Parallel Version: PI ~ " + 4.0 * (count / (double) NUM_SAMPLES) + " calculated in "
+ duration + " msecs");
}
}
The results:
Sequential Version: PI ~ 3.14176568 calculated in 4893 msecs
Parallel Version: PI ~ 3.1417546 calculated in 12044 msecs
Real PI:= 3.141592653589793
I get even worse results running in parallel on my machine (3.0 GHz Intel Core i7, two cores, four threads):
sequential: PI ~ 3.14175124 calculated in 4952 msecs
parallel: PI ~ 3.14167776 calculated in 21320 msecs
I suspect the main reason is that Math.random()
is thread-safe, and so it synchronizes around every call. Since there are multiple threads all trying to get random numbers at the same time, they're all contending for the same lock. This adds a tremendous amount of overhead. Note that the specification for Math.random()
says the following:
This method is properly synchronized to allow correct use by more than one thread. However, if many threads need to generate pseudorandom numbers at a great rate, it may reduce contention for each thread to have its own pseudorandom-number generator.
To avoid lock contention, use ThreadLocalRandom
instead:
long count = LongStream.rangeClosed(1, NUM_SAMPLES)
.parallel()
.filter(e -> {
ThreadLocalRandom cur = ThreadLocalRandom.current();
double x = cur.nextDouble();
double y = cur.nextDouble();
return x * x + y * y < 1;
})
.count();
This gives the following results:
sequential2: PI ~ 3.14169156 calculated in 1171 msecs
parallel2: PI ~ 3.14166796 calculated in 648 msecs
which is 1.8x speedup, not too bad for a two-core machine. Note that this is also faster when run sequentially, probably because there's no lock overhead at all.
Aside: Normally for benchmarks I'd suggest using JMH. However, this benchmark seems to run long enough that it gives a reasonable indication of relative speeds. For more precise results, though, I do recommend using JMH.
UPDATE
Here are additional results (requested by user3666197 in comments), using a NUM_SAMPLES
value of 1_000_000_000
compared to the original 100_000_000
. I've copied the results from above for easy comparison.
NUM_SAMPLES = 100_000_000
sequential: PI ~ 3.14175124 calculated in 4952 msecs
parallel: PI ~ 3.14167776 calculated in 21320 msecs
sequential2: PI ~ 3.14169156 calculated in 1171 msecs
parallel2: PI ~ 3.14166796 calculated in 648 msecs
NUM_SAMPLES = 1_000_000_000
sequential: PI ~ 3.141572896 calculated in 47730 msecs
parallel: PI ~ 3.141543836 calculated in 228969 msecs
sequential2: PI ~ 3.1414865 calculated in 12843 msecs
parallel2: PI ~ 3.141635704 calculated in 7953 msecs
The sequential
and parallel
results are (mostly) the same code as in the question, and sequential2
and parallel2
are using my modified ThreadLocalRandom
code. The new timings are overall roughly 10x longer, as one would expect. The longer parallel2
run isn't quite as fast as one would expect, though it's not totally out of line, showing about a 1.6x speedup on a two-core machine.
Prologue:
if The Motivation for going parallel was indeed a higher performance, a cautious design practice shall validate a use of the just-enough tools without adding any unreasonable steps
( the overhead-strict Amdahl Law formulation shows the cruel penalties if not obeying this design practice -- resulting in slower parallel code-execution, compared to ordinary serial code-execution. The reason is in the very add-on overhead cost, not the nature of the parallel execution. Detailed experimental data confirm this in-vivo ( re-test-able below ). Q.E.D. )
PAR: sTLR PI ~ 3.141674092 calculated in 14386[ms] | log10 DIFF ~ -4.0891707129672366
PAR: sTLR PI ~ 3.141483408 calculated in 14007[ms] | log10 DIFF ~ -3.9615960863226483
PAR: sTLR>PI ~ 3.141480016 calculated in 14249[ms] | log10 DIFF ~ -3.948316651088744
PAR: sTLR>PI ~ 3.141588188 calculated in 13967[ms] | log10 DIFF ~ -5.350121173515822
| || |
LongStream.rangeClose()------------+ || |
ThreadLocalRandom src-of-randomness---+| |
{ _: original | >: mirrored }-counting-+ |
|
PAR3/2wTLR PI ~ 3.141519024 calculated in 7394[ms] | log10 DIFF ~ -4.132947619067715
PAR3/2wTLR PI ~ 3.141737748 calculated in 7741[ms] | log10 DIFF ~ -3.8383493185270883
PAR3/2wTLR>PI ~ 3.141537576 calculated in 6751[ms] | log10 DIFF ~ -4.491034048285749
PAR3/2wTLR>PI ~ 3.14154848 calculated in 7272[ms] | log10 DIFF ~ -4.35483730610719
|| || |
Amdahl-Law-fair-2-blks/2-threads--+| || |
while(){...}-----------------------+ || |
ThreadLocalRandom src-of-randomness---+| |
{ _: original | >: mirrored }-counting-+ |
Why a Sequential version may get faster than a Parallel?
Depends a lot on implementation details.
A platform code-execution timings were observed to bear a noise of about +/- 100 [ms]
for NUM_SAMPLES ~ 1E+8
dart-throws. May Ctrl+C / Ctrl+V / RUN all listed code-approaches for testing different implementations and to be able re-test all approaches with a chance of adding any better alterations from here. N.b.: Running all tests makes sense only if and only if all of them are being run on the very same platform ( be it your localhost
, your preferred TEST-infrastructure or the "here" -- that does not matter per-se, but being it the same platform matters if the observed results have an ambition to become comparable each to the other ( apples to apples, not the apples to oranges ) ) Tests self-document performance, just run 'em
On all parallelisation-ready platforms, there is a risk of a resource-contention, if the process is heavily dependent on a source-of-randomness ( we call it twice per a dart-shot onto PI-boundary experiment ) and such platform implements this peculiar-magic-box ( because of the cryptographically required pure-[SERIAL]-( thus non-parallelisable )-feature of such source-of-randomness ) hardwired onto some hardware element ( i.e. even infinitely parallelisable process will stand in a pure [SEQ]-queue, for getting a cryptographically-robust next-random number ). If some platform can generate massively-parallel random numbers, typically such platform instantiates so many "independent"-sources-of-randomness as necessary, so as to perform in-stream cryptographically-robust sequence of random numbers, but this requires some other approach than a "shared"-sequence-pooling from Math
using a call to its .random()
-method
Why LongStream
version was almost twice slower than a naive loop?
First, let me note, that using any high-level abstraction is the first risk of loosing controls over efficient parallelisation. Yes, LongStream
methods are pretty attractive to implement some scheme on a few SLOCs, but here, the dumb-level code setup shows, that you systematically loose about ~ 1110 - 1200 [ms] to stream-based processing:
----< extended tests >----
( the tests for just 1E+9 samples result on the above cited platform
about
twice slower to run
a LongStream with .parallel().filter().count()
with ThreadLocalRandom
than even
a pure [SEQ] version, running a trivial naive while(){...}
or a split-task .parallel() with distributed summations
and just top-level .map( while(){...} ).sum()
)
1E+9
SEQ wTLR PI ~ 3.141553264 calculated in 7386[ms] | log10 DIFF ~ -4.404618541952148
SEQ wTLR PI ~ 3.1414793 calculated in 7578[ms] | log10 DIFF ~ -3.9455647216537932
SEQ wTLR PI ~ 3.141631852 calculated in 6913[ms] | log10 DIFF ~ -4.40673154655876
PAR: sTLR PI ~ 3.141556252 calculated in 14293[ms] | log10 DIFF ~ -4.438879648677778
PAR: sTLR PI ~ 3.141610744 calculated in 13919[ms] | log10 DIFF ~ -4.742551585240871
| | |
LongStream.rangeClose()------------+ | |
ThreadLocalRandom src-of-randomness---+ |
|
PAR3/2wTLR PI ~ 3.141519396 calculated in 6944[ms] | log10 DIFF ~ -4.135147373916102
|| | |
Amdahl-Law-fair-2-blks/2-threads--+| | |
while(){...}-----------------------+ | |
ThreadLocalRandom src-of-randomness---+ |
|
PAR3/4wTLR PI ~ 3.141657656 calculated in 7489[ms] | log10 DIFF ~ -4.187070539970178
|| | vs 14xxx
Amdahl-Law-fair-4-blks/2-threads--+| | ~~~~~~~~~~~~~ /4+ threads???
while(){...}-----------------------+ |
ThreadLocalRandom src-of-randomness---+
1E+9
SEQ wTLR PI ~ 3.141670832 calculated in 7142[ms] | log10 DIFF ~ -4.106913165387672
SEQ wTLR PI ~ 3.141631852 calculated in 6913[ms] | log10 DIFF ~ -4.40673154655876
PAR3/4wTLR PI ~ 3.141499044 calculated in 7127[ms] | log10 DIFF ~ -4.028679657876861
PAR3/4wTLR PI ~ 3.141521608 calculated in 7420[ms] | log10 DIFF ~ -4.148462876047807
PAR3/2wTLR PI ~ 3.141519396 calculated in 6944[ms] | log10 DIFF ~ -4.135147373916102
PAR3/2wTLR PI ~ 3.141602576 calculated in 6954[ms] | log10 DIFF ~ -5.0033828225658254
Real PI = 3.141592653589793
SEQ wTLR PI ~ 3.141622264 calculated in 6967[ms] | log10 DIFF ~ -4.52855557608415
SEQ wTLR PI ~ 3.141507096 calculated in 6894[ms] | log10 DIFF ~ -4.067741458256534
PAR3/4wTLR PI ~ 3.141655584 calculated in 7106[ms] | log10 DIFF ~ -4.2011394373291715
PAR3/4wTLR PI ~ 3.141564284 calculated in 7386[ms] | log10 DIFF ~ -4.547146943793993
PAR3/2wTLR PI ~ 3.141547824 calculated in 6919[ms] | log10 DIFF ~ -4.348435235065685
PAR3/2wTLR PI ~ 3.141512504 calculated in 6959[ms] | log10 DIFF ~ -4.096098696030472
Real PI = 3.141592653589793
-----< inital tests >-----
stream SEQ: PI ~ 3.14148516 calculated in 2946 msecs
naive SEQ2 PI ~ 3.14149424 calculated in 1830 msecs
stream PAR: PI ~ 3.14171496 calculated in 2853 msecs
stream PAR2 PI ~ 3.1414986 calculated in 3017 msecs
naive PAR3 PI ~ 3.14157392 calculated in 1842 msecs
Real PI = 3.141592653589793
stream SEQ: PI ~ 3.1416566 calculated in 2977 msecs
naive SEQ2 PI ~ 3.14167044 calculated in 1868 msecs
stream PAR: PI ~ 3.1415044 calculated in 2617 msecs
stream PAR2 PI ~ 3.14159492 calculated in 3022 msecs
naive PAR3 PI ~ 3.14154072 calculated in 1811 msecs
Real PI = 3.141592653589793
-----< extended tests >-----
SEQ: MrndPI ~ 3.14184796 calculated in 3003[ms] | log10 DIFF ~ -3.5929382808376515
SEQ: MrndPI ~ 3.14152884 calculated in 2769[ms] | log10 DIFF ~ -4.195086823728298
SEQ2 wMrndPI ~ 3.14192708 calculated in 1895[ms] | log10 DIFF ~ -3.475699432925146
SEQ2 wMrndPI ~ 3.14171308 calculated in 1846[ms] | log10 DIFF ~ -3.9192782593456794
SEQ wTLR PI ~ 3.14165348 calculated in 739[ms] | log10 DIFF ~ -4.215907813544407
SEQ wTLR PI ~ 3.14136852 calculated in 749[ms] | log10 DIFF ~ -3.649493053020102
PAR: sMrndPI ~ 3.14179224 calculated in 2857[ms] | log10 DIFF ~ -3.699869033054068
PAR: sMrndPI ~ 3.141472 calculated in 2652[ms] | log10 DIFF ~ -3.9184597520458992
PAR2 sMrndPI ~ 3.14140116 calculated in 2940[ms] | log10 DIFF ~ -3.717845759366814
PAR2 sMrndPI ~ 3.14157204 calculated in 2930[ms] | log10 DIFF ~ -4.685846370584897
PAR3/1sTLR PI ~ 3.14143636 calculated in 2044[ms] | log10 DIFF ~ -3.8060588337186982
PAR3/1sTLR PI ~ 3.14131304 calculated in 2042[ms] | log10 DIFF ~ -3.5534417248121435
PAR3/4wMrndPI ~ 3.14142288 calculated in 1832[ms] | log10 DIFF ~ -3.770129868268526
PAR3/4wMrndPI ~ 3.1417362 calculated in 1785[ms] | log10 DIFF ~ -3.843007663821216
PAR3/4wTLR PI ~ 3.14152796 calculated in 746[ms] | log10 DIFF ~ -4.189138749553116
PAR3/4wTLR PI ~ 3.14163412 calculated in 748[ms] | log10 DIFF ~ -4.382303560363832
PAR3/2wTLR PI ~ 3.14163616 calculated in 757[ms] | log10 DIFF ~ -4.361446749657272
PAR3/2wTLR PI ~ 3.14137776 calculated in 732[ms] | log10 DIFF ~ -3.6677765391807546
Real PI = 3.141592653589793
1E+8
SEQ wTLR PI ~ 3.14173396 calculated in 710[ms] | log10 DIFF ~ -3.8498381364228056
SEQ wTLR PI ~ 3.14174612 calculated in 714[ms] | log10 DIFF ~ -3.813986665516408
PAR3/4wMrndPI ~ 3.14170104 calculated in 2022[ms] | log10 DIFF ~ -3.9650251674482195
PAR3/4wMrndPI ~ 3.1418928 calculated in 1725[ms] | log10 DIFF ~ -3.522666846499994
PAR3/4wTLR PI ~ 3.14170568 calculated in 734[ms] | log10 DIFF ~ -3.9468200656590717
PAR3/4wTLR PI ~ 3.14148048 calculated in 749[ms] | log10 DIFF ~ -3.9501093815571333
PAR3/2wTLR PI ~ 3.14175252 calculated in 709[ms] | log10 DIFF ~ -3.7962427769936427
PAR3/2wTLR PI ~ 3.14175248 calculated in 741[ms] | log10 DIFF ~ -3.796351454937919
Real PI = 3.141592653589793
What must always know before making a [SEQ] | [PAR]
decision?
Such a decision is neither toss of a coin, nor an easy & straightforward step. We hear many times about immense powers of parallel-computing but the Mother Nature does not let us gain without pain.
Gene AMDAHL has formulated a law, which demonstrates speedups possible for a process, if more processors get harnessed.
The story is not trivial even in the simplified Amdahl's Law formulation ( details will be put later ).
Given a process-P
, we may assume,
for every P
there is always a part scheduled and executed in a [SERIAL]
-way ( let's call it a [SEQ]
-part ) and similarly
for every P
there may be one or more parts of P
, that may get executed in a [PARALLEL]
-fashion ( as resources and mutual-(mis)-coordination permit ).
Amdahl has put such pure-[SERIAL]
process-execution into comparison with a possibly [SEQ]+[PAR]
process-execution, using the former-part as is and gaining the possible speedup just from the latter-part, by doing a split among N
-processors, the speedup S
is:
1
S = ________________; where
( 1 - s ) s := duration a[SEQ]-part took
s + _________ (1-s):= duration a[PAR]-part took before split
N N := number of [PAR]-processors to split the task onto
Even this simplified formulation immediately warns, that:
- The more the
[SEQ]
-part took, the less speedup the [PAR]
-part may deliver by split-onto any amount of N > 1
So the first output for one's decision making is -- isolate and benchmark the [SEQ]-part and [PAR]-part, so as to get a first approximation of what levels of speedup are principally set ( for even infinite amount of processors )
Next comes the hard part of the truth:
There is no such thing as a free dinner. The Mother Nature steps in, there is no gain without pain.
One can always spend some reasonable amount of science on re-factoring the [SEQ]-part, which yields immediate effects on the overall process-duration. Theory of [?TIME][?SPACE]
-complexity ( ? == { c-onst | p-olynomial | exp-onential }
) principally introduces potential costs and savings achievable in these efforts. Benchmarks, meaning REALISTIC benchmarks will tell the difference. What could work nice and smooth on in-cache-sizes, will choke and progress much slower on beyond-cache-size scales, etc., etc. So, always design REALISTIC benchmarks at scales, that remain relevant for the target process-scaling, otherwise poor decisions come from a table of such a scaling-naive experimentator.
Next, one can harness some parallel-computing tools, so as to get the [PAR]
-part executed faster on some available sub-set of processors.
Given a [SEQ]-part was squashed close to the silicon-limits, and the [PAR]
-part duration remains of a reasonable portion of the whole original process-duration, the [PAR]
-tools may indeed help --- BUT AT A COST --- yes, one has always pay some cost of "arranging" all the toys so as to start executing in [PAR]
. Similarly, at the end of the [PAR]-scheduled code-execution, there is another step, to somehow collect the results, deactivate / terminate the individual [PAR]-executions and let these results delivered back into the [SEQ]-part of the code, where the ( hopefully ) "accelerated" results get further used.
All this is called an overhead. The overhead, let's call it o
, remains hidden from one's eyes in the simplified ( overhead-naive ) Amdahl's Law formulation, but the real-world execution will always demonstrate this add-on part in benchmarking, as the code cannot "skip" it and somehow exhibit a magic instruction for a direct "jump"-from-[SEQ]-into-[PAR]
and a "jump"-from-[PAR]-back-to-[SEQ]
, such things simply do not exist...
Overhead-strict formulation of the Amdahl's Law speedup S
:
1
S = __________________________; where s, ( 1 - s ), N were defined above
( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on
s + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead add-on
N
Reading the both formulations, one need not be any mathematical genius to realise, that it is fairly enough if one will have to principally pay some expensive costs of ( pSO + pTO )
to then realise the [PARALLEL]
-version will get slower, not faster than a pure [SERIAL]
-version of a process, that was naively expected to get theoretically accelerated, even if infinite amount of processors will get harnessed.
Q.E.D.
In the Epilogue section of this, there is an easy to experiment interactive + animated toy-tool with sliders for both o
, p == ( 1 - s )
, to evaluate the effects of overhead-strict Amdahl's Law for one's individual [SEQ]-[PAR]
-process compositions. Feel free to click-through and experiment on your own.
This will help you test, when principal [PAR]
-overheads start to get at least adjusted by N
-CPU-[PARALLEL]
-execution and when any such efforts remain wasted in just a principally worse performance, than a [SERIAL]
-process execution, which is always, in this sense, an overhead-free and cheap competitor.