Multithreading with Java's htmlunit.WebClient

2019-07-30 18:52发布

问题:

I am writing a multithreaded scraper using Java and htmlunit's WebClient. I am using a pool of proxies, and I have a simple class to handle them. It has the list of proxies, and you call a GetProxy function to get the IP and port of the next proxy in the list. I have tested it thoroughly, and I have confirmed that it is working as intended with any number of threads.

From there I have a getHTML function where I can pass in a URL and a proxy and it will return the page for me:

public String getHTML(String URL, ProxyData pData)
{
    WebClient webClient = new WebClient();
    String pageAsXml = "";

    webClient.setJavaScriptEnabled(false);

    ProxyConfig pConf = new ProxyConfig(pData._host, pData._port);
    webClient.setProxyConfig(pConf);

    try
    {
        HtmlPage page = webClient.getPage(URL);
        pageAsXml = page.asXml();
    }
    catch (FailingHttpStatusCodeException e)
    {
        e.printStackTrace();
    }
        catch (MalformedURLException e)
    {
        e.printStackTrace();
    }
        catch (IOException e)
    {
        e.printStackTrace();
    }

    webClient.closeAllWindows();

    return pageAsXml;
}

If I write the WebClients proxy settings to the console after setting them in the code, it appears to be the correct IP. Stepping through it in debug mode also confirms this. However, the resulting HTML returned doesn't seem to reflect the changed proxy.

I am using WhatIsMyIP's automation page to check my proxies to see if they are working (http://automation.whatismyip.com/n09230945.asp). After every time I get a page I write the proxy I passed into the function, the proxy WebClient said it was using at the time of the page load, and then the proxy that was returned in the HTML, to the console. The first two always match fine, but the returned IP is off. They will all be correct the first time, but then they seem to start reusing the proxies. And the proxies don't always get reused within the same thread. They seem to just choose a random proxy that already exists.

It seems like the proxies get reused randomly for a while before they are actually replaced, even across threads. Even though I set a new proxy, and the WebClient seems to know that I have set a new proxy, it still seems to use an old one.

So what is causing this, and how do I get around it?

回答1:

Here a framework that essentially does the multithreading over a pool of proxies via htmlunit for you: https://github.com/subes/invesdwin-webproxy

It also solves other problems like too many instances of htmlunits javascript parser exhausting the cpu and other issues. Maybe the code can give you a hint what you can do different when using htmlunit in that manner in your own framework.