Parallel execution for IO bound operations

2019-02-02 14:37发布

问题:

I have read TPL and Task library documents cover to cover. But, I still couldn't comprehend the following case very clearly and right now I need to implement it.

I will simplify my situation. I have an IEnumerable<Uri> of length 1000. I have to make a request for them using HttpClient.

I have two questions.

  1. There is not much computation, just waiting for Http request. In this case can I still use Parallel.Foreach() ?
  2. In case of using Task instead, what is the best practice for creating huge number of them? Let's say I use Task.Factory.StartNew() and add those tasks to a list and wait for all of them. Is there a feature (such as TPL partitioner) that controls number of maximum tasks and maximum HttpClient I can create?

There are couple of similar questions on SO, but no one mentions the maximums. The requirement is just using maximum tasks with maximum HttpClient.

Thank you in advance.

回答1:

In this case can I still use Parallel.Foreach ?

This isn't really appropriate. Parallel.Foreach is more for CPU intensive work. It also doesn't support async operations.

In case of using Task instead, what is the best practice for creating huge number of them?

Use a TPL Dataflow block instead. You don't create huge amounts of tasks that sit there waiting for a thread to become available. You can configure the max amount of tasks and reuse them for all the items that meanwhile sit in a buffer waiting for a task. For example:

var block = new ActionBlock<Uri>(
    uri => SendRequestAsync(uri),
    new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 });

foreach (var uri in uris)
{
    block.Post(uri);
}

block.Complete();
await block.Completion;


回答2:

i3arnon's answer with TPL Dataflow is good; Dataflow is useful especially if you have a mix of CPU and I/O bound code. I'll echo his sentiment that Parallel is designed for CPU-bound code; it's not the best solution for I/O-based code, and especially not appropriate for asynchronous code.

If you want an alternative solution that works well with mostly-I/O code - and doesn't require an external library - the method you're looking for is Task.WhenAll:

var tasks = uris.Select(uri => SendRequestAsync(uri)).ToArray();
await Task.WhenAll(tasks);

This is the easiest solution, but it does have the drawback of starting all requests simultaneously. Particularly if all requests are going to the same service (or a small set of services), this can cause timeouts. To solve this, you need to use some kind of throttling...

Is there a feature (such as TPL partitioner) that controls number of maximum tasks and maximum HttpClient I can create?

TPL Dataflow has that nice MaxDegreeOfParallelism which only starts so many at a time. You can also throttle regular asynchronous code by using another builtin, SemaphoreSlim:

private readonly SemaphoreSlim _sem = new SemaphoreSlim(50);
private async Task SendRequestAsync(Uri uri)
{
  await _sem.WaitAsync();
  try
  {
    ...
  }
  finally
  {
    _sem.Release();
  }
}

In case of using Task instead, what is the best practice for creating huge number of them? Let's say I use Task.Factory.StartNew() and add those tasks to a list and wait for all of them.

You actually don't want to use StartNew. It only has one appropriate use case (dynamic task-based parallelism), which is extremely rare. Modern code should use Task.Run if you need to push work onto a background thread. But you don't even need that to begin with, so neither StartNew nor Task.Run is appropriate here.

There are couple of similar questions on SO, but no one mentions the maximums. The requirement is just using maximum tasks with maximum HttpClient.

Maximums are where asynchronous code really gets tricky. With CPU-bound (parallel) code, the solution is obvious: you use as many threads as you have cores. (Well, at least you can start there and adjust as necessary). With asynchronous code, there isn't as obvious of a solution. It depends on a lot of factors - how much memory you have, how the remote server responds (rate limiting, timeouts, etc), etc.

There's no easy solutions here. You just have to test out how your specific application deals with high levels of concurrency, and then throttle to some lower number.


I have some slides for a talk that attempts to explain when different technologies are appropriate (parallelism, asynchrony, TPL Dataflow, and Rx). If you prefer more of a written description with recipes, I think you may benefit from my book on concurrency.