Recently I started working on trying to mass-scrape a website for archiving purposes and I thought it would be a good idea to have multiple web requests working asynchronously to speed things up (10,000,000 pages is definitely a lot to archive) and so I ventured into the harsh mistress of parallelism, three minutes later I start to wonder why the tasks I'm creating (via Task.Factory.StartNew
) are 'clogging'.
Annoyed and intrigued I decided to test this to see if it wasn't just a result of circumstance, so I created a new console project in VS2012 and created this:
static void Main(string[] args)
{
for (int i = 0; i < 10; i++) {
int i2 = i + 1;
Stopwatch t = new Stopwatch();
t.Start();
Task.Factory.StartNew(() => {
t.Stop();
Console.ForegroundColor = ConsoleColor.Green; //Note that the other tasks might manage to write their lines between these colour changes messing up the colours.
Console.WriteLine("Task " + i2 + " started after " + t.Elapsed.Seconds + "." + t.Elapsed.Milliseconds + "s");
Thread.Sleep(5000);
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("Task " + i2 + " finished");
});
}
Console.ReadKey();
}
That when run came up with this result:
As you can see the first four tasks start within quick succession with times of ~0.27, however after that the tasks start to drastically increase in the time it takes them to start.
Why is this happening and what can I do to fix or get around this limitation?
The tasks (by default) runs on the threadpool, which is just as it sounds, a pool of threads. The threadpool is optimized for a lot of situations, but throwing Thread.Sleep
in there probably throws a wrench in most of them. Also, Task.Factory.StartNew
is a generally a bad idea to use, because people doesn't understand how it works. Try this instead:
static void Main(string[] args)
{
for (int i = 0; i < 10; i++) {
int i2 = i + 1;
Stopwatch t = new Stopwatch();
t.Start();
Task.Run(async () => {
t.Stop();
Console.ForegroundColor = ConsoleColor.Green; //Note that the other tasks might manage to write their lines between these colour changes messing up the colours.
Console.WriteLine("Task " + i2 + " started after " + t.Elapsed.Seconds + "." + t.Elapsed.Milliseconds + "s");
await Task.Delay(5000);
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("Task " + i2 + " finished");
});
}
Console.ReadKey();
}
More explanation:
The threadpool has a limited number of threads at it's disposal. This number changes depending on certain conditions, however, in general it holds true. For this reason, you should never do anything blocking on the threadpool (if you want to achieve parallelism that is). Thread.Sleep
is a perfect example of a blocking API, but so is most web request APIs, unless you use the newer async versions.
So the problem in your original program with crawling is probably the same as in the sample you posted. You are blocking all the thread pool threads, and thus it's getting forced to spin up new threads, and ends up clogging.
Extra goodies
Coincidentally, using Task.Run
in this way also easily allows you to rewrite the code in such a way that you can know when it's complete. By storing a reference to all of the started tasks, and awaiting them all at the end (this does not prevent parallelism), you can reliably know when all the tasks have completed. The following shows how to achieve that:
static void Main(string[] args)
{
var tasks = new List<Task>();
for (int i = 0; i < 10; i++) {
int i2 = i + 1;
Stopwatch t = new Stopwatch();
t.Start();
tasks.Add(Task.Run(async () => {
t.Stop();
Console.ForegroundColor = ConsoleColor.Green; //Note that the other tasks might manage to write their lines between these colour changes messing up the colours.
Console.WriteLine("Task " + i2 + " started after " + t.Elapsed.Seconds + "." + t.Elapsed.Milliseconds + "s");
await Task.Delay(5000);
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("Task " + i2 + " finished");
}));
}
Task.WaitAll(tasks.ToArray());
Console.WriteLine("All tasks completed");
Console.ReadKey();
}
Note: this code has not been tested
Read more
More info on Task.Factory.StartNew
and why it should be avoided: http://blog.stephencleary.com/2013/08/startnew-is-dangerous.html.
I think this is occurring because you have exhausted all available threads in the thread pool. Try starting your tasks using TaskCreationOptions.LongRunning
. More details here.
Another problem is that you are using Thread.Sleep
, this blocks the current thread and its a waste of resources. Try waiting asynchronously using await Task.Delay
. You may need to change your lambda to be async
.
Task.Factory.StartNew(async () => {
t.Stop();
Console.ForegroundColor = ConsoleColor.Green; //Note that the other tasks might manage to write their lines between these colour changes messing up the colours.
Console.WriteLine("Task " + i2 + " started after " + t.Elapsed.Seconds + "." + t.Elapsed.Milliseconds + "s");
await Task.Delay(5000);
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("Task " + i2 + " finished");
});