Disappointing performance with Parallel.For

2020-05-26 09:08发布

I am trying to speed up my calculation times by using Parallel.For. I have an Intel Core i7 Q840 CPU with 8 cores, but I only manage to get a performance ratio of 4 compared to a sequential for loop. Is this as good as it can get with Parallel.For, or can the method call be fine-tuned to increase performance?

Here is my test code, sequential:

var loops = 200;
var perloop = 10000000;

var sum = 0.0;
for (var k = 0; k < loops; ++k)
{
    var sumk = 0.0;
    for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
    sum += sumk;
}

and parallel:

sum = 0.0;
Parallel.For(0, loops,
                k =>
                    {
                        var sumk = 0.0;
                        for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
                        sum += sumk;
                    });

The loop that I am parallelizing involves computation with a "globally" defined variable, sum, but this should only amount to a tiny, tiny fraction of the total time within the parallelized loop.

In Release build ("optimize code" flag set) the sequential for loop takes 33.7 s on my computer, whereas the Parallel.For loop takes 8.4 s, a performance ratio of only 4.0.

In the Task Manager, I can see that the CPU utilization is 10-11% during the sequential calculation, whereas it is only 70% during the parallel calculation. I have tried to explicitly set

ParallelOptions.MaxDegreesOfParallelism = Environment.ProcessorCount

but to no avail. It is not clear to me why not all CPU power is assigned to the parallel calculation?

Sequential vs. parallel CPU utilization

I have noticed that a similar question has been raised on SO before, with an even more disappointing result. However, that question also involved inferior parallelization in a third-party library. My primary concern is parallelization of basic operations in the core libraries.

UPDATE

It was pointed out to me in some of the comments that the CPU I am using only has 4 physical cores, which is visible to the system as 8 cores if hyper threading is enabled. For the sake of it, I disabled hyper-threading and re-benchmarked.

With hyper-threading disabled, my calculations are now faster, both the parallel and also the (what I thought was) sequential for loop. CPU utilization during the for loop is up to approx. 45% (!!!) and 100% during the Parallel.For loop.

Computation time for the for loop 15.6 s (more than twice as fast as with hyper-threading enabled) and 6.2 s for Parallel.For (25% better than when hyper-threading is enabled). Performance ratio with Parallel.For is now only 2.5, running on 4 real cores.

So the performance ratio is still substantially lower than expected, despite hyper-threading being disabled. On the other hand it is intriguing that CPU utilization is so high during the for loop? Could there be some kind of internal parallelization going on in this loop as well?

4条回答
三岁会撩人
2楼-- · 2020-05-26 09:23

foreach vs parallel for each an example

    for (int i = 0; i < 10; i++)
    {
        int[] array = new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 };
        Stopwatch watch = new Stopwatch();
        watch.Start();
        //Parallel foreach
        Parallel.ForEach(array, line =>
        {
            for (int x = 0; x < 1000000; x++)
            {

            }

        });

        watch.Stop();
        Console.WriteLine("Parallel.ForEach {0}", watch.Elapsed.Milliseconds);
        watch = new Stopwatch();
        //foreach
        watch.Start();
        foreach (int item in array)
        {
            for (int z = 0; z < 10000000; z++)
            {

            }
        }
        watch.Stop();
        Console.WriteLine("ForEach {0}", watch.Elapsed.Milliseconds);

        Console.WriteLine("####");
    }
    Console.ReadKey();

enter image description here

My CPU

Intel® Core™ i7-620M Processor (4M Cache, 2.66 GHz)

查看更多
不美不萌又怎样
3楼-- · 2020-05-26 09:25

I think the computation gain is so low because your code is "too easy" to work on other task each iteration - because parallel.for just create new task in each iteration, so this will take time to service them in threads. I will it like this:

int[] nums = Enumerable.Range(0, 1000000).ToArray();
long total = 0;

Parallel.ForEach(
    Partitioner.Create(0, nums.Length),
    () => 0,
    (part, loopState, partSum) =>
    {
        for (int i = part.Item1; i < part.Item2; i++)
        {
            partSum += nums[i];
        }
        return partSum;
    },
    (partSum) =>
    {
        Interlocked.Add(ref total, partSum);
    }
);

Partitioner will create optimal part of job for each task, there will be less time for service task with threads. If you can, please benchmark this solution and tell us if it get better speed up.

查看更多
叛逆
4楼-- · 2020-05-26 09:36

Parallel.For and Parallel.ForEach will use a degree of parallelism that it feels is appropriate, balancing the cost to setup and tear down threads and the work it expects each thread will perform. .NET 4.5 made several improvements to performance (including more intelligent decisions on the number of threads to spin up) compared to previous .NET versions.

Note that, even if it were to spin up one thread per core, context switches, false sharing issues, resource locks, and other issues may prevent you from achieving linear scalability (in general, not necessarily with your specific code example).

查看更多
家丑人穷心不美
5楼-- · 2020-05-26 09:41

Using a global variable can introduce significant synchronization problems, even when you are not using locks. When you assign a value to the variable each core will have to get access to the same place in system memory, or wait for the other core to finish before accessing it. You can avoid corruption without locks by using the lighter Interlocked.Add method to add a value to the sum atomically, at the OS level, but you will still get delays due to contention.

The proper way to do this is to update a thread local variable to create the partial sums and add all of them to a single global sum at the end. Parallel.For has an overload that does just this. MSDN even has an example using sumation at How To: Write a Parallel.For Loop that has Thread Local Variables

        int[] nums = Enumerable.Range(0, 1000000).ToArray();
        long total = 0;

        // Use type parameter to make subtotal a long, not an int
        Parallel.For<long>(0, nums.Length, () => 0, (j, loop, subtotal) =>
        {
            subtotal += nums[j];
            return subtotal;
        },
            (x) => Interlocked.Add(ref total, x)
        );

Each thread updates its own subtotal value and updates the global total using Interlocked.Add when it finishes.

查看更多
登录 后发表回答