Parallelizing a task using .AsParallel().ForAll or

2019-04-14 06:33发布

I have a list of websites and a list of proxy servers.

I have this action

Action<string> action = (string url) =>
{
    var proxy = ProxyHandler.GetProxy();
    HtmlDocument html = null;
    while (html == null)
    {
        try
        {

            html = htmlDocumentLoader.LoadDocument(url, proxy.Address);

            // Various db manipulation code

            ProxyHandler.ReleaseProxy(proxy);
        }
        catch (Exception exc)
        {
            Console.WriteLine("{0} proxies remain", ProxyHandler.ListSize());

            // Various db manipulation code

            proxy = ProxyHandler.GetProxy();
        }
    }
};

Which I call using

urlList.AsParallel().WithDegreeOfParallelism(12).ForAll(action);

or

Parallel.ForEach(urlList, action);

My ProxyHandler class is as follows

public static class ProxyHandler
{    
    static List<Proxy> ProxyList = new ProxyRepository().GetAliveProxies().ToList();

    public static Proxy GetProxy()
    {
        lock (ProxyList)
        {
            while (ProxyList.Count == 0)
            {
                Console.WriteLine("Sleeping");
                Thread.Sleep(1000);
            }
            var proxy = ProxyList[0];
            ProxyList.RemoveAt(0);
            return proxy;
        }           
    }

    public static void ReleaseProxy(Proxy proxy)
    {
        lock (ProxyList)
        {
            if(!ProxyList.Contains(proxy))ProxyList.Add(proxy);
        }
    }

    public static int ListSize()
    {
        lock (ProxyList)
        {
            return ProxyList.Count;
        }
    }
}

My problem is that when this is executing it appears to complete ~90% of websites really quick and then take a really long time to do the remaining.

What I mean is out of 100 urls it take as much time to do the first 90 as it does doing the last 10.

I have ruled out proxies being dead since no exception is thrown. It appears as if the last of the items on the urlList just take really long to complete.

UPDATE:

I am adding some running data to make my problem clearer:

Minute    1 2   3   4   5   6   7   8   9   16  18  19
Count    23 32  32  17  6   1   1   1   1   2   1   2

As you can see in the first 4 minutes I do 104/119 of the requests. And then it takes 15 minutes to do the rest.

This looks like a problem in the joining of the Threads, but I fail to spot what this might be.

2条回答
beautiful°
2楼-- · 2019-04-14 06:57

You are wasting threads and CPU time. In this case you would have 12 threads; each thread would process only one url in a time. So, you will process only 12 urls in a time. Moreover, most of the time these threads would do nothing (they would just wait for a free proxy or for a loaded page) while they could be used for more useful tasks.

To avoid this, you should use non-blocking IO operations. So, instead of using htmlDocumentLoader.LoadDocument you should consider to use one of its asynchronous interface (htmlDocumentLoader.BeginLoadDocument / htmlDocumentLoader.EndLoadDocument or htmlDocumentLoader.LoadDocumentAsync / htmlDocumentLoader.LoadDocumentCompleted). In this case, if you have 100 urls, all of them will be loaded simultaneously without creating extra threads and wasting CPU time. Only when page is loaded, the new thread will be created (actually took from ThreadPool) to handle it.

The way you wait for a free proxy is wasteful too. Instead of using while (ProxyList.Count == 0) which freezes the thread in case if no free proxy, consider to use timer which would wake up each second and check whether free proxy is available. It is not the best solution, but at least it would not waste threads. The better solution is to add an event to ProxyHandler which would notify when proxy is available.

查看更多
Animai°情兽
3楼-- · 2019-04-14 07:06

Your problem is probably due to the Partitioner being used by PLinq.

If the Range Partitiner is being used, your collection of urls is split into groups with an equal(ish) number of urls in each. Then a task is started for each group with no further synchronization.

This means that there will be one task that takes longest and still has work to do when all the other tasks have finished. This effectively means that the last part of the operation is single-threaded.

The solution is to use a different Partitioner. You may be able to use the built-in Chunk Partitioner as explained on MSDN.

If that doesn't work well enough, you will have to write / find a partitioner implementation that yields elements one-by-one. This is built in to C# 5: EnumerablePartitionerOptions

查看更多
登录 后发表回答