HtmlUnit WebClient Timeout

In my previous questions about HtmlUnit Skip particular Javascript execution in HTML unit and Fetch Page source using HtmlUnit : URL got stuck

I had mentioned that URL is getting stuck. I also found out that it is getting stuck due to one of the methods(parse) in HtmlUnit library is not coming out of execution.

I did further work on this. I wrote code to get out of the method if it takes more than specified time-out seconds to complete.

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class HandleHtmlUnitTimeout {

public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException, InterruptedException, TimeoutException 
    {   
        Date start = new Date();
        String url = "http://ericaweiner.com/collections/";
        doWorkWithTimeout(url, 60);
    }

public static void doWorkWithTimeout(final String url, long timeoutSecs) throws InterruptedException, TimeoutException {
    //maintains a thread for executing the doWork method
    ExecutorService executor = Executors.newFixedThreadPool(1);
    //logger.info("Starting method with "+timeoutSecs+" seconds as timeout");
    //set the executor thread working

    final Future<?> future = executor.submit(new Runnable() {
        public void run() 
            {
            try 
                {
                getPageSource(url);
                }
            catch (Exception e) 
                {
                throw new RuntimeException(e);
                }
        }
    });

    //check the outcome of the executor thread and limit the time allowed for it to complete
    try {
        future.get(timeoutSecs, TimeUnit.SECONDS);
    } catch (Exception e) {
        //ExecutionException: deliverer threw exception
        //TimeoutException: didn't complete within downloadTimeoutSecs
        //InterruptedException: the executor thread was interrupted

        //interrupts the worker thread if necessary
        future.cancel(true);

        //logger.warn("encountered problem while doing some work", e);
        throw new TimeoutException();
    }finally{ 
    executor.shutdownNow();
    }
}

public static void getPageSource(String productPageUrl)
    {
    try {
    if(productPageUrl == null)
        {
        productPageUrl = "http://ericaweiner.com/collections/";
        }   

        WebClient wb = new WebClient(BrowserVersion.FIREFOX_3_6);
        wb.getOptions().setTimeout(120000);
        wb.getOptions().setJavaScriptEnabled(true);
        wb.getOptions().setThrowExceptionOnScriptError(true);
        wb.getOptions().setThrowExceptionOnFailingStatusCode(false);
        HtmlPage page = wb.getPage(productPageUrl);
        wb.waitForBackgroundJavaScript(4000);
        wb.closeAllWindows();
} 
catch (FailingHttpStatusCodeException e) 
    {
    e.printStackTrace();
    } 
catch (MalformedURLException e) 
    {
    e.printStackTrace();
    } 
catch (IOException e) 
    {
    e.printStackTrace();
    }
    }

}

This code does come out of doWorkWithTimeout(url, 60); method. But this does not terminate.

When I try to call similiar implementation with following code:

import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;

import org.apache.log4j.Logger;


public class HandleScraperTimeOut {

private static Logger logger = Logger.getLogger(HandleScraperTimeOut .class);


public void doWork() throws InterruptedException {
    logger.info(new Date()+ "Starting worker method ");
    Thread.sleep(20000);
    logger.info(new Date()+ "Ending worker method ");
    //perform some long running task here...
}

public void doWorkWithTimeout(int timeoutSecs) {
    //maintains a thread for executing the doWork method
    ExecutorService executor = Executors.newFixedThreadPool(1);
    logger.info("Starting method with "+timeoutSecs+" seconds as timeout");
    //set the executor thread working

    final Future<?> future = executor.submit(new Runnable() {
        public void run() 
            {
            try 
                {
                doWork();
                }
            catch (Exception e) 
                {
                throw new RuntimeException(e);
                }
        }
    });

    //check the outcome of the executor thread and limit the time allowed for it to complete
    try {
        future.get(timeoutSecs, TimeUnit.SECONDS);
    } catch (Exception e) {
        //ExecutionException: deliverer threw exception
        //TimeoutException: didn't complete within downloadTimeoutSecs
        //InterruptedException: the executor thread was interrupted

        //interrupts the worker thread if necessary
        future.cancel(true);

        logger.warn("encountered problem while doing some work", e);
    }
    executor.shutdown();
}

public static void main(String a[])
    {
        HandleScraperTimeOut hcto = new HandleScraperTimeOut ();
        hcto.doWorkWithTimeout(30);

    }

}

If anybody can have a look and tell me what is the issue, it will be really helpful.

For more details about issue, you can look into Skip particular Javascript execution in HTML unit and Fetch Page source using HtmlUnit : URL got stuck

Update 1 Strange thing is : future.cancel(true); is returning TRUE in both cases. How I expected it to be was :

With HtmlUnit it should return FALSE since process is still hanging.
With normal Thread.sleep(); it should return TRUE since the process got cancelled successfully.

Update 2 It only hangs with http://ericaweiner.com/collections/ URL. If I give any other URL i.e. http://www.google.com , http://www.yahoo.com , It does not hand. In these cases it throws IntruptedException and come out of the Process.

It seems that http://ericaweiner.com/collections/ page source has certain elements which are causing problems.

标签： java multithreading timeout web-scraping htmlunit

1条回答

时光不老，我们不散

2楼-- · 2019-04-11 20:32

Future.cancel(boolean) returns:

false if the task could not be cancelled, typically because it has already completed normally
true otherwise

Cancelled means means the thread did not finish before cancel, the canceled flag was set to true and if requested the thread was interrupted.

Interrupt the thread menans it called Thread.interrupt and nothing more. Future.cancel(boolean) does not check if the thread actually stopped.

So it is right that cancel return true on that cases.

Interrupting a thread means it should stop as soon as possible but it is not enforced. You can try to make it stop/fail closing a resource it needs or something. I usually do that with a thread reading (waiting incoming data) from a socket. I close the socket so it stops waiting.

0人赞添加讨论(0) 举报

HtmlUnit WebClient Timeout

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间