Best multi-thread approach for multiple web reques

2020-05-29 03:47发布

问题:

I want to create a program to crawl and check my websites for http errors and other things. I want to do this with multiple threads that should accept parameters like the url to crawl. Although I want X threads to be active there are Y Tasks waiting already to be executed.

Now I wanted to know what is the best strategy to do this: ThreadPool, Tasks, Threads or even something else?

回答1:

Here's an example that shows how to queue up a bunch of tasks but limit the number that are concurrently running . It uses a Queue to keep track of tasks that are ready to run and uses a Dictionary to keep track of tasks that are running. When a task finishes it invokes a callback method to remove itself from the Dictionary. An async method is used to launch queued tasks as space becomes available.

using System;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;

namespace MinimalTaskDemo
{
    class Program
    {
        private static readonly Queue<Task> WaitingTasks = new Queue<Task>();
        private static readonly Dictionary<int, Task> RunningTasks = new Dictionary<int, Task>();
        public static int MaxRunningTasks = 100; // vary this to dynamically throttle launching new tasks 

        static void Main(string[] args)
        {
            var tokenSource = new CancellationTokenSource();
            var token = tokenSource.Token;
            Worker.Done = new Worker.DoneDelegate(WorkerDone);
            for (int i = 0; i < 1000; i++)  // queue some tasks
            {
                // task state (i) will be our key for RunningTasks
                WaitingTasks.Enqueue(new Task(id => new Worker().DoWork((int)id, token), i, token));
            }
            LaunchTasks();
            Console.ReadKey();
            if (RunningTasks.Count > 0)
            {
                lock (WaitingTasks) WaitingTasks.Clear();
                tokenSource.Cancel();
                Console.ReadKey();
            }
        }

        static async void LaunchTasks()
        {
            // keep checking until we're done
            while ((WaitingTasks.Count > 0) || (RunningTasks.Count > 0))
            {
                // launch tasks when there's room
                while ((WaitingTasks.Count > 0) && (RunningTasks.Count < MaxRunningTasks))
                {
                    Task task = WaitingTasks.Dequeue();
                    lock (RunningTasks) RunningTasks.Add((int)task.AsyncState, task);
                    task.Start();
                }
                UpdateConsole();
                await Task.Delay(300); // wait before checking again
            }
            UpdateConsole();    // all done
        }

        static void UpdateConsole()
        {
            Console.Write(string.Format("\rwaiting: {0,3:##0}  running: {1,3:##0} ", WaitingTasks.Count, RunningTasks.Count));
        }

        // callback from finished worker
        static void WorkerDone(int id)
        {
            lock (RunningTasks) RunningTasks.Remove(id);
        }
    }

    internal class Worker
    {
        public delegate void DoneDelegate(int taskId);
        public static DoneDelegate Done { private get; set; }
        private static readonly Random Rnd = new Random();

        public async void DoWork(object id, CancellationToken token)
        {
            for (int i = 0; i < Rnd.Next(20); i++)
            {
                if (token.IsCancellationRequested) break;
                await Task.Delay(100);  // simulate work
            }
            Done((int)id);
        }
    }
}


回答2:

I recommend using (asynchronous) Tasks for downloading the data and then processing (on the thread pool).

Instead of throttling tasks, I recommend you throttle the number of requests per target server. Good news: .NET already does this for you.

This makes your code as simple as:

private static readonly HttpClient client = new HttpClient();
public async Task Crawl(string url)
{
  var html = await client.GetString(url);
  var nextUrls = await Task.Run(ProcessHtml(html));
  var nextTasks = nextUrls.Select(nextUrl => Crawl(nextUrl));
  await Task.WhenAll(nextTasks);
}
private IEnumerable<string> ProcessHtml(string html)
{
  // return all urls in the html string.
}

which you can kick off with a simple:

await Crawl("http://example.org/");


回答3:

I would recommend going with the threadPool. Is is easy enough to work with, as it has a few benefits:

"Thread pool will provide benefits for frequent and relatively short operations by Reusing threads that have already been created instead of creating new ones (an expensive process) Throttling the rate of thread creation when there is a burst of requests for new work items (I believe this is only in .NET 3.5)

If you queue 100 thread pool tasks, it will only use as many threads as have already been created to service these requests (say 10 for example). The thread pool will make frequent checks (I believe every 500ms in 3.5 SP1) and if there are queued tasks, it will make one new thread. If your tasks are quick, then the number of new threads will be small and reusing the 10 or so threads for the short tasks will be faster than creating 100 threads up front.

If your workload consistently has large numbers of thread pool requests coming in, then the thread pool will tune itself to your workload by creating more threads in the pool by the above process so that there are a larger number of thread available to process requests"

Thread vs ThreadPool



回答4:

Well, Task is a good way to go, because it would mean that you don't have to worry about writing a lot of the "plumbing" code.

I would recommend that you check out Joe Albahari's web site on threading as well, it's quite a good primer on threading:

http://www.albahari.com/threading/