I am using this method to instantiate a web browser programmatically, navigate to a url and return a result when the document has completed.
How would I be able to stop the Task
and have GetFinalUrl()
return null
if the document takes more than 5 seconds to load?
I have seen many examples using a TaskFactory
but I haven't been able to apply it to this code.
private Uri GetFinalUrl(PortalMerchant portalMerchant)
{
SetBrowserFeatureControl();
Uri finalUri = null;
if (string.IsNullOrEmpty(portalMerchant.Url))
{
return null;
}
Uri trackingUrl = new Uri(portalMerchant.Url);
var task = MessageLoopWorker.Run(DoWorkAsync, trackingUrl);
task.Wait();
if (!String.IsNullOrEmpty(task.Result.ToString()))
{
return new Uri(task.Result.ToString());
}
else
{
throw new Exception("Parsing Failed");
}
}
// by Noseratio - http://stackoverflow.com/users/1768303/noseratio
static async Task<object> DoWorkAsync(object[] args)
{
_threadCount++;
Console.WriteLine("Thread count:" + _threadCount);
Uri retVal = null;
var wb = new WebBrowser();
wb.ScriptErrorsSuppressed = true;
TaskCompletionSource<bool> tcs = null;
WebBrowserDocumentCompletedEventHandler documentCompletedHandler = (s, e) => tcs.TrySetResult(true);
foreach (var url in args)
{
tcs = new TaskCompletionSource<bool>();
wb.DocumentCompleted += documentCompletedHandler;
try
{
wb.Navigate(url.ToString());
await tcs.Task;
}
finally
{
wb.DocumentCompleted -= documentCompletedHandler;
}
retVal = wb.Url;
wb.Dispose();
return retVal;
}
return null;
}
public static class MessageLoopWorker
{
#region Public static methods
public static async Task<object> Run(Func<object[], Task<object>> worker, params object[] args)
{
var tcs = new TaskCompletionSource<object>();
var thread = new Thread(() =>
{
EventHandler idleHandler = null;
idleHandler = async (s, e) =>
{
// handle Application.Idle just once
Application.Idle -= idleHandler;
// return to the message loop
await Task.Yield();
// and continue asynchronously
// propogate the result or exception
try
{
var result = await worker(args);
tcs.SetResult(result);
}
catch (Exception ex)
{
tcs.SetException(ex);
}
// signal to exit the message loop
// Application.Run will exit at this point
Application.ExitThread();
};
// handle Application.Idle just once
// to make sure we're inside the message loop
// and SynchronizationContext has been correctly installed
Application.Idle += idleHandler;
Application.Run();
});
// set STA model for the new thread
thread.SetApartmentState(ApartmentState.STA);
// start the thread and await for the task
thread.Start();
try
{
return await tcs.Task;
}
finally
{
thread.Join();
}
}
#endregion
}
I'm trying to take benefit from Noseratio's solution as well as following advices from Stephen Cleary.
Here is the code I updated to include in the code from Stephen the code from Noseratio regarding the AJAX tip.
First part: the
Task NavigateAsync
advised by StephenSecond part: a new
Task NavAjaxAsync
to run the tip for AJAX (based on Noseratio's code)Third part: a new
Task NavAndAjaxAsync
to get the navigation and the AJAXFourth and last part: the updated
Task GetUrlAsync
from Stephen with Noseratio's code for AJAXI'd like to know if this is the right approach.
I suspect running a processing loop on another thread will not work out well, since
WebBrowser
is a UI component that hosts an ActiveX control.When you're writing TAP over EAP wrappers, I recommend using extension methods to keep the code clean:
Now your code can easily apply a timeout:
which can be consumed as such:
Updated: the latest version of the
WebBrowser
-based console web scrapper can be found on Github.Updated: Adding a pool of
WebBrowser
objects for multiple parallel downloads.Below is an implementation of more or less generic
WebBrowser
-based web scrapper, which works as console application. It's a consolidation of some of my previousWebBrowser
-related efforts, including the code referenced in the question:Capturing an image of the web page with opacity
Loading a page with dynamic AJAX content
Creating an STA message loop thread for
WebBrowser
Loading a set of URLs, one after another
Printing a set of URLs with
WebBrowser
Web page UI automation
A few points:
Reusable
MessageLoopApartment
class is used to start and run a WinForms STA thread with its own message pump. It can be used from a console application, as below. This class exposes a TPL Task Scheduler (FromCurrentSynchronizationContext
) and a set ofTask.Factory.StartNew
wrappers to use this task scheduler.This makes
async/await
a great tool for runningWebBrowser
navigation tasks on that separate STA thread. This way, aWebBrowser
object gets created, navigated and destroyed on that thread. Although,MessageLoopApartment
is not tied up toWebBrowser
specifically.It's important to enable HTML5 rendering using Browser Feature Control, as otherwise the
WebBrowser
obejcts runs in IE7 emulation mode by default. That's whatSetFeatureBrowserEmulation
does below.It may not always be possible to determine when a web page has finished rendering with 100% probability. Some pages are quite complex and use continuous AJAX updates. Yet we can get quite close, by handling
DocumentCompleted
event first, then polling the page's current HTML snapshot for changes and checking theWebBrowser.IsBusy
property. That's whatNavigateAsync
does below.A time-out logic is present on top of the above, in case the page rendering is never-ending (note
CancellationTokenSource
andCreateLinkedTokenSource
).