How much does parallelization help the performance

2019-08-15 08:29发布

问题:

I parallelized a Java program. On a Mac with 4 cores, below is the time for different number of threads.

threads #   1         2          4           8          16
time 2597192200 1915988600  2086557400  2043377000  1931178200

On a Linux server with two sockets, each with 4 cores, below is the measured time.

threads #   1         2          4           8          16 
time 4204436859 2760602109  1850708620  2370905549  2422668438

As you seen, the speedup is far away from linear speedup. There is almost no parallelization overhead in this case, like synchronization, or I/O dependencies.

I have two questions:

  1. Do these data imply this Java program is memory-bound ?
  2. If so, is there any way to further improve the performance without changing the hardware?

回答1:

Answering the Title Question

Amdahl's Law explains that the speed-up obtained parallelizing a program depends on how much of the program is parallelizable.

And we must also add in the overhead for coordinating the parallelism.

So we consider what percent/parts of the program is/are parallelizable, and what overhead (synchronization, communication, false sharing, etc.) is incurred.

Is Reading Memory Parallelizable?

From hard drive

You can read from 2 different hard disk drives at the same time without a slow down.

But, usually parallelism does not provide a speed-up to reading from a hard drive.

Hard disk drives (i.e. drives with a spinning disk) have been optimized to read sequentially, and jumping around between memory locations will slow down the overall memory transfer.

Solid state drives are actually quite good at randomly accessing data, jumping here and there in memory, so with solid state drives keeping the read/write queue full is a good idea.

From RAM and Cache

Understanding the idea of a cache-line will help avoid false-sharing.

This type of memory operation can be parallelized effectively, such as iterating over an array by dividing it into four partitions.

Your Question

I'm assuming that your times are in nano-seconds, so on computer 1, the program took 2.5 secs and then leveled off to about 2 seconds, with a peak of a 1.9 seconds.

I am hoping that you had minimal background programs running at the same time, and that you performed these tests a few times to get rid of irregularities.

Also, irregularities could come up in timing due to the Just In Time compiling (JIT) of the Java virtual machine, so to accurately time, you want to run the code in a loop a few times, and store the time of the last iteration. (or pre-compile to native code).

Also, since the first time the program is run, much of the data that was used from hard drive would be moved into the cache, so later executions should be faster. (So either use a timing from the last run after looping to ensure the memory is in cache, or use the first timing but power off and on the computer between timings).

Is the program Memory Bound?

Based only on your timings, this is hard to say.

The first computer took 2.5 seconds, then had a 20% speed-up with 2 threads, but then stayed at about 2.0 seconds.

By itself, this speedup could just have been the results of the JIT and the cache memory being filled by the timing on 1 thread. After that, any differences in run-time might just be noise.

The second computer took 4.2 seconds, then 2.8, then 1.9, then back to about 2.3 seconds.

This one does seem to demonstrate some type of a speed-up with parallelism, but some time of contention occurs (memory, cache-lines, synchronization, or etc.) as demonstrated by the increase in time form 4 threads to 8 threads.

Any way to improve performance?

Use a profiler on your code, determine what parts of your code are taking up the most time.

(You can simulate a profiler, by debugging your code and breaking and see where the program is. Repeat that 10 times, to see if there is one part that is proportionally more stopped at than another.)

Use better algorithms or Arrange the data in memory (data structures) in a better way for the problem.

Exploit more parallelism in the problem.

Try to make hard drive memory reads sequential. Maybe have just one thread with reads from the hard drive and then puts the data in a concurrent queue to be operated on by the other threads.



回答2:

Well, they imply that the algorithm is not CPU bound. It is likely bound by something else - it could be memory, or I/O, or something, but it's likely not CPU bound.