I have an application which converts some data often there are 1.000 - 30.000 files.
I need to do 3 steps:
- copy a File (replace some text in there)
- Make a Webrequest with WebClient to download a file (I send the copied file to a WebServer, which converts the file to another format)
- Take the downloaded file and change some of the content
So all three steps include some I/O and I used async/await methods:
var tasks = files.Select(async (file) =>
{
Item item = await createtempFile(file).ConfigureAwait(false);
await convert(item).ConfigureAwait(false);
await clean(item).ConfigureAwait(false);
}).ToList();
await Task.WhenAll(tasks).ConfigureAwait(false);
I don´t know if this is the best practice, because I create more than thousand tasks. I thought about splitting the three steps like:
List<Item> items = new List<Item>();
var tasks = files.Select(async (file) =>
{
Item item = await createtempFile(file, ext).ConfigureAwait(false);
lock(items)
items.Add(item);
}).ToList();
await Task.WhenAll(tasks).ConfigureAwait(false);
var tasks = items.Select(async (item) =>
{
await convert(item, baseAddress, ext).ConfigureAwait(false);
}).ToList();
await Task.WhenAll(tasks).ConfigureAwait(false);
var tasks = items.Select(async (item) =>
{
await clean(targetFile, item.Doctype, ext).ConfigureAwait(false);
}).ToList();
await Task.WhenAll(tasks).ConfigureAwait(false);
But that doesn´t seem to be better or faster, because I create 3 times thousands of tasks.
Should I throttle the creation of tasks? Like chunks of 100 tasks? Or am I just overthinking it and the creation of thousands of tasks is just fine.
The CPU is idling with 2-4% peak, so I thought about too many awaits or context switches.
Maybe the WebRequest calls are too many, because the WebServer/WebService can´t handle thousands of Requests simultaneously and I should only throttle the WebRequests?
I already increased the .NET maxconnection in the app.config file.
As commenters have correctly noted, you're overthinking it. The .NET runtime has absolutely no problem tracking thousands of tasks.
However, you might want to consider using a TPL Dataflow pipeline, which would enable you to easily have different concurrency levels for different operations ("blocks") in your pipeline.
It is possible to execute async operations in parallel with limiting the number of concurrent operations. There is a cool extension method for that, it is not part of the .Net framework.
Call it like this: