Can someone please explain me why the sequential version π-approximation was faster than the parallel one?

I can't figure it out

I'm playing around with using a very well-known π-approximation example. I pick random points in the unit square ( ( 0, 0 ) to ( 1, 1 ) ) and see how many of random points do fall inside the area of unit circle. The fraction should be the value of π / 4.

public class PIEstimation {
    final static int NUM_SAMPLES = 100000000;

    public static void main(String[] args) {

    sequentialVersion();
    parallelVersion();
    System.out.println("               Real PI:= " + Math.PI);
    }

    public static void sequentialVersion() {
    final long start = System.nanoTime();

    final long count = LongStream
        .rangeClosed(1, NUM_SAMPLES)
        .filter(e -> {
                double x = Math.random();
                double y = Math.random();
                return x * x + y * y < 1;
    }).count();

    final long duration = ((System.nanoTime() - start) / 1_000_000);

    System.out.println("Sequential Version: PI ~ " + 4.0 * (count / (double) NUM_SAMPLES) + " calculated in "
        + duration + " msecs");
    }

    public static void parallelVersion() {
    final long start = System.nanoTime();

    final long count = LongStream
        .rangeClosed(1, NUM_SAMPLES)
        .parallel()
        .filter(e -> {
                double x = Math.random();
                double y = Math.random();
                return x * x + y * y < 1;
    }).count();

    final long duration = ((System.nanoTime() - start) / 1_000_000);

    System.out.println("  Parallel Version: PI ~ " + 4.0 * (count / (double) NUM_SAMPLES) + " calculated in "
        + duration + " msecs");
    }

}

The results:

Sequential Version: PI ~ 3.14176568 calculated in  4893 msecs
  Parallel Version: PI ~ 3.1417546  calculated in 12044 msecs
               Real PI:= 3.141592653589793

回答1:

I get even worse results running in parallel on my machine (3.0 GHz Intel Core i7, two cores, four threads):

sequential: PI ~ 3.14175124 calculated in  4952 msecs
  parallel: PI ~ 3.14167776 calculated in 21320 msecs

I suspect the main reason is that Math.random() is thread-safe, and so it synchronizes around every call. Since there are multiple threads all trying to get random numbers at the same time, they're all contending for the same lock. This adds a tremendous amount of overhead. Note that the specification for Math.random() says the following:

This method is properly synchronized to allow correct use by more than one thread. However, if many threads need to generate pseudorandom numbers at a great rate, it may reduce contention for each thread to have its own pseudorandom-number generator.

To avoid lock contention, use ThreadLocalRandom instead:

long count = LongStream.rangeClosed(1, NUM_SAMPLES)
                       .parallel()
                       .filter(e -> {
                           ThreadLocalRandom cur = ThreadLocalRandom.current();
                           double x = cur.nextDouble();
                           double y = cur.nextDouble();
                           return x * x + y * y < 1;
                       })
                       .count();

This gives the following results:

sequential2: PI ~ 3.14169156 calculated in 1171 msecs
  parallel2: PI ~ 3.14166796 calculated in  648 msecs

which is 1.8x speedup, not too bad for a two-core machine. Note that this is also faster when run sequentially, probably because there's no lock overhead at all.

Aside: Normally for benchmarks I'd suggest using JMH. However, this benchmark seems to run long enough that it gives a reasonable indication of relative speeds. For more precise results, though, I do recommend using JMH.

UPDATE

Here are additional results (requested by user3666197 in comments), using a NUM_SAMPLES value of 1_000_000_000 compared to the original 100_000_000. I've copied the results from above for easy comparison.

NUM_SAMPLES = 100_000_000

sequential:  PI ~ 3.14175124 calculated in    4952 msecs
parallel:    PI ~ 3.14167776 calculated in   21320 msecs
sequential2: PI ~ 3.14169156 calculated in    1171 msecs
parallel2:   PI ~ 3.14166796 calculated in     648 msecs

NUM_SAMPLES = 1_000_000_000

sequential:  PI ~ 3.141572896 calculated in  47730 msecs
parallel:    PI ~ 3.141543836 calculated in 228969 msecs
sequential2: PI ~ 3.1414865   calculated in  12843 msecs
parallel2:   PI ~ 3.141635704 calculated in   7953 msecs

The sequential and parallel results are (mostly) the same code as in the question, and sequential2 and parallel2 are using my modified ThreadLocalRandom code. The new timings are overall roughly 10x longer, as one would expect. The longer parallel2 run isn't quite as fast as one would expect, though it's not totally out of line, showing about a 1.6x speedup on a two-core machine.

回答2:

Prologue:

if The Motivation for going parallel was indeed a higher performance, a cautious design practice shall validate a use of the just-enough tools without adding any unreasonable steps
_{( the overhead-strict Amdahl Law formulation shows the cruel penalties if not obeying this design practice -- resulting in slower parallel code-execution, compared to ordinary serial code-execution. The reason is in the very add-on overhead cost, not the nature of the parallel execution. Detailed experimental data confirm this in-vivo ( re-test-able below ). Q.E.D. )}

                              PAR:  sTLR PI ~ 3.141674092  calculated in 14386[ms] | log10 DIFF ~ -4.0891707129672366
                              PAR:  sTLR PI ~ 3.141483408  calculated in 14007[ms] | log10 DIFF ~ -3.9615960863226483
                              PAR:  sTLR>PI ~ 3.141480016  calculated in 14249[ms] | log10 DIFF ~ -3.948316651088744
                              PAR:  sTLR>PI ~ 3.141588188  calculated in 13967[ms] | log10 DIFF ~ -5.350121173515822
                                    |  ||                                  |
 LongStream.rangeClose()------------+  ||                                  |
 ThreadLocalRandom src-of-randomness---+|                                  |
 { _: original | >: mirrored }-counting-+                                  |   
                                                                           |
                              PAR3/2wTLR PI ~ 3.141519024  calculated in  7394[ms] | log10 DIFF ~ -4.132947619067715
                              PAR3/2wTLR PI ~ 3.141737748  calculated in  7741[ms] | log10 DIFF ~ -3.8383493185270883
                              PAR3/2wTLR>PI ~ 3.141537576  calculated in  6751[ms] | log10 DIFF ~ -4.491034048285749
                              PAR3/2wTLR>PI ~ 3.14154848   calculated in  7272[ms] | log10 DIFF ~ -4.35483730610719
                                   ||  ||                                  |
 Amdahl-Law-fair-2-blks/2-threads--+|  ||                                  |
 while(){...}-----------------------+  ||                                  |
 ThreadLocalRandom src-of-randomness---+|                                  |
 { _: original | >: mirrored }-counting-+                                  |

Why a Sequential version may get faster than a Parallel?

Depends a lot on implementation details.

A platform code-execution timings were observed to bear a noise of about +/- 100 [ms] for NUM_SAMPLES ~ 1E+8 dart-throws. May Ctrl+C / Ctrl+V / RUN all listed code-approaches for testing different implementations and to be able re-test all approaches with a chance of adding any better alterations from here. _{N.b.: Running all tests makes sense only if and only if all of them are being run on the very same platform ( be it your localhost, your preferred TEST-infrastructure or the "here" -- that does not matter per-se, but being it the same platform matters if the observed results have an ambition to become comparable each to the other ( apples to apples, not the apples to oranges ) )} Tests self-document performance, just run 'em

On all parallelisation-ready platforms, there is a risk of a resource-contention, if the process is heavily dependent on a source-of-randomness ( we call it twice per a dart-shot onto PI-boundary experiment ) and such platform implements this peculiar-magic-box ( because of the cryptographically required pure-[SERIAL]-( thus non-parallelisable )-feature of such source-of-randomness ) hardwired onto some hardware element ( i.e. even infinitely parallelisable process will stand in a pure [SEQ]-queue, for getting a cryptographically-robust next-random number ). If some platform can generate massively-parallel random numbers, typically such platform instantiates so many "independent"-sources-of-randomness as necessary, so as to perform in-stream cryptographically-robust sequence of random numbers, but this requires some other approach than a "shared"-sequence-pooling from Math using a call to its .random()-method

Why `LongStream` version was almost twice slower than a naive loop?

First, let me note, that using any high-level abstraction is the first risk of loosing controls over efficient parallelisation. Yes, LongStream methods are pretty attractive to implement some scheme on a few SLOCs, but here, the dumb-level code setup shows, that you systematically loose about ~ 1110 - 1200 [ms] to stream-based processing:

----< extended tests >----
( the tests for just 1E+9 samples result on the above cited platform
  about
  twice slower to run
     a LongStream with .parallel().filter().count()
     with ThreadLocalRandom
  than even
     a pure [SEQ] version, running a trivial naive while(){...}
  or a split-task .parallel() with distributed summations
     and just top-level .map( while(){...} ).sum()
  )
                          1E+9
                              SEQ   wTLR PI ~ 3.141553264 calculated in  7386[ms] | log10 DIFF ~ -4.404618541952148
                              SEQ   wTLR PI ~ 3.1414793   calculated in  7578[ms] | log10 DIFF ~ -3.9455647216537932
                              SEQ   wTLR PI ~ 3.141631852 calculated in  6913[ms] | log10 DIFF ~ -4.40673154655876
                              PAR:  sTLR PI ~ 3.141556252 calculated in 14293[ms] | log10 DIFF ~ -4.438879648677778
                              PAR:  sTLR PI ~ 3.141610744 calculated in 13919[ms] | log10 DIFF ~ -4.742551585240871
                                    |  |                                  |
 LongStream.rangeClose()------------+  |                                  |
 ThreadLocalRandom src-of-randomness---+                                  |
                                                                          |
                              PAR3/2wTLR PI ~ 3.141519396 calculated in  6944[ms] | log10 DIFF ~ -4.135147373916102
                                   ||  |                                  |
 Amdahl-Law-fair-2-blks/2-threads--+|  |                                  |
 while(){...}-----------------------+  |                                  |
 ThreadLocalRandom src-of-randomness---+                                  |
                                                                          |
                              PAR3/4wTLR PI ~ 3.141657656 calculated in  7489[ms] | log10 DIFF ~ -4.187070539970178
                                   ||  |                             vs 14xxx
 Amdahl-Law-fair-4-blks/2-threads--+|  | ~~~~~~~~~~~~~ /4+ threads???
 while(){...}-----------------------+  |
 ThreadLocalRandom src-of-randomness---+

                          1E+9
                              SEQ   wTLR PI ~ 3.141670832 calculated in 7142[ms] | log10 DIFF ~ -4.106913165387672
                              SEQ   wTLR PI ~ 3.141631852 calculated in 6913[ms] | log10 DIFF ~ -4.40673154655876
                              PAR3/4wTLR PI ~ 3.141499044 calculated in 7127[ms] | log10 DIFF ~ -4.028679657876861
                              PAR3/4wTLR PI ~ 3.141521608 calculated in 7420[ms] | log10 DIFF ~ -4.148462876047807
                              PAR3/2wTLR PI ~ 3.141519396 calculated in 6944[ms] | log10 DIFF ~ -4.135147373916102
                              PAR3/2wTLR PI ~ 3.141602576 calculated in 6954[ms] | log10 DIFF ~ -5.0033828225658254
                              Real       PI = 3.141592653589793

                              SEQ   wTLR PI ~ 3.141622264 calculated in 6967[ms] | log10 DIFF ~ -4.52855557608415
                              SEQ   wTLR PI ~ 3.141507096 calculated in 6894[ms] | log10 DIFF ~ -4.067741458256534
                              PAR3/4wTLR PI ~ 3.141655584 calculated in 7106[ms] | log10 DIFF ~ -4.2011394373291715
                              PAR3/4wTLR PI ~ 3.141564284 calculated in 7386[ms] | log10 DIFF ~ -4.547146943793993
                              PAR3/2wTLR PI ~ 3.141547824 calculated in 6919[ms] | log10 DIFF ~ -4.348435235065685
                              PAR3/2wTLR PI ~ 3.141512504 calculated in 6959[ms] | log10 DIFF ~ -4.096098696030472
                              Real       PI = 3.141592653589793

 -----< inital tests >-----
                             stream SEQ: PI ~ 3.14148516 calculated in 2946 msecs
                              naive SEQ2 PI ~ 3.14149424 calculated in 1830 msecs

                             stream PAR: PI ~ 3.14171496 calculated in 2853 msecs
                             stream PAR2 PI ~ 3.1414986  calculated in 3017 msecs
                              naive PAR3 PI ~ 3.14157392 calculated in 1842 msecs
                                    Real PI = 3.141592653589793

                             stream SEQ: PI ~ 3.1416566  calculated in 2977 msecs
                              naive SEQ2 PI ~ 3.14167044 calculated in 1868 msecs
                             stream PAR: PI ~ 3.1415044  calculated in 2617 msecs
                             stream PAR2 PI ~ 3.14159492 calculated in 3022 msecs
                              naive PAR3 PI ~ 3.14154072 calculated in 1811 msecs
                                    Real PI = 3.141592653589793
 -----< extended tests >-----
                              SEQ:   MrndPI ~ 3.14184796 calculated in 3003[ms] | log10 DIFF ~ -3.5929382808376515
                              SEQ:   MrndPI ~ 3.14152884 calculated in 2769[ms] | log10 DIFF ~ -4.195086823728298
                              SEQ2  wMrndPI ~ 3.14192708 calculated in 1895[ms] | log10 DIFF ~ -3.475699432925146
                              SEQ2  wMrndPI ~ 3.14171308 calculated in 1846[ms] | log10 DIFF ~ -3.9192782593456794
                              SEQ   wTLR PI ~ 3.14165348 calculated in  739[ms] | log10 DIFF ~ -4.215907813544407
                              SEQ   wTLR PI ~ 3.14136852 calculated in  749[ms] | log10 DIFF ~ -3.649493053020102
                              PAR:  sMrndPI ~ 3.14179224 calculated in 2857[ms] | log10 DIFF ~ -3.699869033054068
                              PAR:  sMrndPI ~ 3.141472   calculated in 2652[ms] | log10 DIFF ~ -3.9184597520458992
                              PAR2  sMrndPI ~ 3.14140116 calculated in 2940[ms] | log10 DIFF ~ -3.717845759366814
                              PAR2  sMrndPI ~ 3.14157204 calculated in 2930[ms] | log10 DIFF ~ -4.685846370584897
                              PAR3/1sTLR PI ~ 3.14143636 calculated in 2044[ms] | log10 DIFF ~ -3.8060588337186982
                              PAR3/1sTLR PI ~ 3.14131304 calculated in 2042[ms] | log10 DIFF ~ -3.5534417248121435
                              PAR3/4wMrndPI ~ 3.14142288 calculated in 1832[ms] | log10 DIFF ~ -3.770129868268526
                              PAR3/4wMrndPI ~ 3.1417362  calculated in 1785[ms] | log10 DIFF ~ -3.843007663821216
                              PAR3/4wTLR PI ~ 3.14152796 calculated in  746[ms] | log10 DIFF ~ -4.189138749553116
                              PAR3/4wTLR PI ~ 3.14163412 calculated in  748[ms] | log10 DIFF ~ -4.382303560363832
                              PAR3/2wTLR PI ~ 3.14163616 calculated in  757[ms] | log10 DIFF ~ -4.361446749657272
                              PAR3/2wTLR PI ~ 3.14137776 calculated in  732[ms] | log10 DIFF ~ -3.6677765391807546
                              Real       PI = 3.141592653589793

                          1E+8
                              SEQ   wTLR PI ~ 3.14173396 calculated in  710[ms] | log10 DIFF ~ -3.8498381364228056
                              SEQ   wTLR PI ~ 3.14174612 calculated in  714[ms] | log10 DIFF ~ -3.813986665516408
                              PAR3/4wMrndPI ~ 3.14170104 calculated in 2022[ms] | log10 DIFF ~ -3.9650251674482195
                              PAR3/4wMrndPI ~ 3.1418928  calculated in 1725[ms] | log10 DIFF ~ -3.522666846499994
                              PAR3/4wTLR PI ~ 3.14170568 calculated in  734[ms] | log10 DIFF ~ -3.9468200656590717
                              PAR3/4wTLR PI ~ 3.14148048 calculated in  749[ms] | log10 DIFF ~ -3.9501093815571333
                              PAR3/2wTLR PI ~ 3.14175252 calculated in  709[ms] | log10 DIFF ~ -3.7962427769936427
                              PAR3/2wTLR PI ~ 3.14175248 calculated in  741[ms] | log10 DIFF ~ -3.796351454937919
                              Real       PI = 3.141592653589793

What must always know before making a `[SEQ] | [PAR]` decision?

Such a decision is neither toss of a coin, nor an easy & straightforward step. We hear many times about immense powers of parallel-computing but the Mother Nature does not let us gain without pain.

Gene AMDAHL has formulated a law, which demonstrates speedups possible for a process, if more processors get harnessed.

The story is not trivial even in the simplified Amdahl's Law formulation ( details will be put later ).

Given a process-P, we may assume,
for every P there is always a part scheduled and executed in a [SERIAL]-way ( let's call it a [SEQ]-part ) and similarly
for every P there may be one or more parts of P, that may get executed in a [PARALLEL]-fashion ( as resources and mutual-(mis)-coordination permit ).

Amdahl has put such pure-[SERIAL] process-execution into comparison with a possibly [SEQ]+[PAR] process-execution, using the former-part as is and gaining the possible speedup just from the latter-part, by doing a split among N-processors, the speedup S is:

            1
S =  ________________; where
          ( 1 - s )        s := duration a[SEQ]-part took
     s  + _________     (1-s):= duration a[PAR]-part took before split
              N            N := number of [PAR]-processors to split the task onto

Even this simplified formulation immediately warns, that:

The more the [SEQ]-part took, the less speedup the [PAR]-part may deliver by split-onto any amount of N > 1

So the first output for one's decision making is -- isolate and benchmark the [SEQ]-part and [PAR]-part, so as to get a first approximation of what levels of speedup are principally set ( for even infinite amount of processors )

Next comes the hard part of the truth:

There is no such thing as a free dinner. The Mother Nature steps in, there is no gain without pain.

One can always spend some reasonable amount of science on re-factoring the [SEQ]-part, which yields immediate effects on the overall process-duration. Theory of [?TIME][?SPACE]-complexity ( ? == { c-onst | p-olynomial | exp-onential } ) principally introduces potential costs and savings achievable in these efforts. Benchmarks, meaning REALISTIC benchmarks will tell the difference. What could work nice and smooth on in-cache-sizes, will choke and progress much slower on beyond-cache-size scales, etc., etc. So, always design REALISTIC benchmarks at scales, that remain relevant for the target process-scaling, otherwise poor decisions come from a table of such a scaling-naive experimentator.

Next, one can harness some parallel-computing tools, so as to get the [PAR]-part executed faster on some available sub-set of processors.

Given a [SEQ]-part was squashed close to the silicon-limits, and the [PAR]-part duration remains of a reasonable portion of the whole original process-duration, the [PAR]-tools may indeed help --- BUT AT A COST --- yes, one has always pay some cost of "arranging" all the toys so as to start executing in [PAR]. Similarly, at the end of the [PAR]-scheduled code-execution, there is another step, to somehow collect the results, deactivate / terminate the individual [PAR]-executions and let these results delivered back into the [SEQ]-part of the code, where the ( hopefully ) "accelerated" results get further used.

All this is called an overhead. The overhead, let's call it o, remains hidden from one's eyes in the simplified ( overhead-naive ) Amdahl's Law formulation, but the real-world execution will always demonstrate this add-on part in benchmarking, as the code cannot "skip" it and somehow exhibit a magic instruction for a direct "jump"-from-[SEQ]-into-[PAR] and a "jump"-from-[PAR]-back-to-[SEQ], such things simply do not exist...

Overhead-strict formulation of the Amdahl's Law speedup S:

               1
S =  __________________________; where s, ( 1 - s ), N were defined above
                ( 1 - s )            pSO:= [PAR]-Setup-Overhead     add-on
     s  + pSO + _________ + pTO      pTO:= [PAR]-Terminate-Overhead add-on
                    N

Reading the both formulations, one need not be any mathematical genius to realise, that it is fairly enough if one will have to principally pay some expensive costs of ( pSO + pTO ) to then realise the [PARALLEL]-version will get slower, not faster than a pure [SERIAL]-version of a process, that was naively expected to get theoretically accelerated, even if infinite amount of processors will get harnessed.

Q.E.D.

In the Epilogue section of this, there is an easy to experiment interactive + animated toy-tool with sliders for both o, p == ( 1 - s ), to evaluate the effects of overhead-strict Amdahl's Law for one's individual [SEQ]-[PAR]-process compositions. Feel free to click-through and experiment on your own.

This will help you test, when principal [PAR]-overheads start to get at least adjusted by N-CPU-[PARALLEL]-execution and when any such efforts remain wasted in just a principally worse performance, than a [SERIAL]-process execution, which is always, in this sense, an overhead-free and cheap competitor.