I have a list of websites and a list of proxy servers.
I have this action
Action<string> action = (string url) =>
{
var proxy = ProxyHandler.GetProxy();
HtmlDocument html = null;
while (html == null)
{
try
{
html = htmlDocumentLoader.LoadDocument(url, proxy.Address);
// Various db manipulation code
ProxyHandler.ReleaseProxy(proxy);
}
catch (Exception exc)
{
Console.WriteLine("{0} proxies remain", ProxyHandler.ListSize());
// Various db manipulation code
proxy = ProxyHandler.GetProxy();
}
}
};
Which I call using
urlList.AsParallel().WithDegreeOfParallelism(12).ForAll(action);
or
Parallel.ForEach(urlList, action);
My ProxyHandler class is as follows
public static class ProxyHandler
{
static List<Proxy> ProxyList = new ProxyRepository().GetAliveProxies().ToList();
public static Proxy GetProxy()
{
lock (ProxyList)
{
while (ProxyList.Count == 0)
{
Console.WriteLine("Sleeping");
Thread.Sleep(1000);
}
var proxy = ProxyList[0];
ProxyList.RemoveAt(0);
return proxy;
}
}
public static void ReleaseProxy(Proxy proxy)
{
lock (ProxyList)
{
if(!ProxyList.Contains(proxy))ProxyList.Add(proxy);
}
}
public static int ListSize()
{
lock (ProxyList)
{
return ProxyList.Count;
}
}
}
My problem is that when this is executing it appears to complete ~90% of websites really quick and then take a really long time to do the remaining.
What I mean is out of 100 urls it take as much time to do the first 90 as it does doing the last 10.
I have ruled out proxies being dead since no exception is thrown. It appears as if the last of the items on the urlList just take really long to complete.
UPDATE:
I am adding some running data to make my problem clearer:
Minute 1 2 3 4 5 6 7 8 9 16 18 19
Count 23 32 32 17 6 1 1 1 1 2 1 2
As you can see in the first 4 minutes I do 104/119 of the requests. And then it takes 15 minutes to do the rest.
This looks like a problem in the joining of the Threads, but I fail to spot what this might be.