I am writing a multithreaded scraper using Java and htmlunit's WebClient. I am using a pool of proxies, and I have a simple class to handle them. It has the list of proxies, and you call a GetProxy function to get the IP and port of the next proxy in the list. I have tested it thoroughly, and I have confirmed that it is working as intended with any number of threads.
From there I have a getHTML function where I can pass in a URL and a proxy and it will return the page for me:
public String getHTML(String URL, ProxyData pData)
{
WebClient webClient = new WebClient();
String pageAsXml = "";
webClient.setJavaScriptEnabled(false);
ProxyConfig pConf = new ProxyConfig(pData._host, pData._port);
webClient.setProxyConfig(pConf);
try
{
HtmlPage page = webClient.getPage(URL);
pageAsXml = page.asXml();
}
catch (FailingHttpStatusCodeException e)
{
e.printStackTrace();
}
catch (MalformedURLException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
webClient.closeAllWindows();
return pageAsXml;
}
If I write the WebClients proxy settings to the console after setting them in the code, it appears to be the correct IP. Stepping through it in debug mode also confirms this. However, the resulting HTML returned doesn't seem to reflect the changed proxy.
I am using WhatIsMyIP's automation page to check my proxies to see if they are working (http://automation.whatismyip.com/n09230945.asp). After every time I get a page I write the proxy I passed into the function, the proxy WebClient said it was using at the time of the page load, and then the proxy that was returned in the HTML, to the console. The first two always match fine, but the returned IP is off. They will all be correct the first time, but then they seem to start reusing the proxies. And the proxies don't always get reused within the same thread. They seem to just choose a random proxy that already exists.
It seems like the proxies get reused randomly for a while before they are actually replaced, even across threads. Even though I set a new proxy, and the WebClient seems to know that I have set a new proxy, it still seems to use an old one.
So what is causing this, and how do I get around it?