I am trying to speed up my calculation times by using Parallel.For
. I have an Intel Core i7 Q840 CPU with 8 cores, but I only manage to get a performance ratio of 4 compared to a sequential for
loop. Is this as good as it can get with Parallel.For
, or can the method call be fine-tuned to increase performance?
Here is my test code, sequential:
var loops = 200;
var perloop = 10000000;
var sum = 0.0;
for (var k = 0; k < loops; ++k)
{
var sumk = 0.0;
for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
sum += sumk;
}
and parallel:
sum = 0.0;
Parallel.For(0, loops,
k =>
{
var sumk = 0.0;
for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
sum += sumk;
});
The loop that I am parallelizing involves computation with a "globally" defined variable, sum
, but this should only amount to a tiny, tiny fraction of the total time within the parallelized loop.
In Release build ("optimize code" flag set) the sequential for
loop takes 33.7 s on my computer, whereas the Parallel.For
loop takes 8.4 s, a performance ratio of only 4.0.
In the Task Manager, I can see that the CPU utilization is 10-11% during the sequential calculation, whereas it is only 70% during the parallel calculation. I have tried to explicitly set
ParallelOptions.MaxDegreesOfParallelism = Environment.ProcessorCount
but to no avail. It is not clear to me why not all CPU power is assigned to the parallel calculation?
I have noticed that a similar question has been raised on SO before, with an even more disappointing result. However, that question also involved inferior parallelization in a third-party library. My primary concern is parallelization of basic operations in the core libraries.
UPDATE
It was pointed out to me in some of the comments that the CPU I am using only has 4 physical cores, which is visible to the system as 8 cores if hyper threading is enabled. For the sake of it, I disabled hyper-threading and re-benchmarked.
With hyper-threading disabled, my calculations are now faster, both the parallel and also the (what I thought was) sequential for
loop. CPU utilization during the for
loop is up to approx. 45% (!!!) and 100% during the Parallel.For
loop.
Computation time for the for
loop 15.6 s (more than twice as fast as with hyper-threading enabled) and 6.2 s for Parallel.For
(25% better than when hyper-threading is enabled). Performance ratio with Parallel.For
is now only 2.5, running on 4 real cores.
So the performance ratio is still substantially lower than expected, despite hyper-threading being disabled. On the other hand it is intriguing that CPU utilization is so high during the for
loop? Could there be some kind of internal parallelization going on in this loop as well?
foreach vs parallel for each an example
My CPU
Intel® Core™ i7-620M Processor (4M Cache, 2.66 GHz)
I think the computation gain is so low because your code is "too easy" to work on other task each iteration - because parallel.for just create new task in each iteration, so this will take time to service them in threads. I will it like this:
Partitioner will create optimal part of job for each task, there will be less time for service task with threads. If you can, please benchmark this solution and tell us if it get better speed up.
Parallel.For and Parallel.ForEach will use a degree of parallelism that it feels is appropriate, balancing the cost to setup and tear down threads and the work it expects each thread will perform. .NET 4.5 made several improvements to performance (including more intelligent decisions on the number of threads to spin up) compared to previous .NET versions.
Note that, even if it were to spin up one thread per core, context switches, false sharing issues, resource locks, and other issues may prevent you from achieving linear scalability (in general, not necessarily with your specific code example).
Using a global variable can introduce significant synchronization problems, even when you are not using locks. When you assign a value to the variable each core will have to get access to the same place in system memory, or wait for the other core to finish before accessing it. You can avoid corruption without locks by using the lighter Interlocked.Add method to add a value to the sum atomically, at the OS level, but you will still get delays due to contention.
The proper way to do this is to update a thread local variable to create the partial sums and add all of them to a single global sum at the end. Parallel.For has an overload that does just this. MSDN even has an example using sumation at How To: Write a Parallel.For Loop that has Thread Local Variables
Each thread updates its own subtotal value and updates the global total using Interlocked.Add when it finishes.