I have what is hopefully a simple task, but it's going to take someone that's versed in CefSharp to solve it.
I have an url that I want to retrieve the HTML from. The problem is this particular url doesn't actually distribute the page on a GET. Instead, it pushes a mound of Javascript to the browser, which then executes and produces the actual rendered page. This means that the usual approaches involving HttpWebRequest
and HttpWebResponse
aren't going to work.
I've looked at a number of different "headless" options, and the one that I think best meets my needs for a number of reasons is CefSharp.Offscreen. But I'm at a loss as to how this thing works. I see that there are several events that can be subscribed to, and some configuration options, but I don't need anything like an embedded browser.
All I really need is a way to do something like this (pseudocode):
string html = CefSharp.Get(url);
I don't have a problem subscribing to events, if that's what's needed to wait for the Javascript to execute and produce the rendered page.
I know I am doing some archaeology reviving a 2yo post, but a detailed answered may be of use for someone else.
So yes, Cefsharp.Offscreen is fit to the task.
Here under is a class which will handle all the browser activity.
Now in my app I just need to do the following :
And here is the string I get
"<html><head></head><body><pre style=\"word-wrap: break-word; white-space: pre-wrap;\">NotGonnaGiveYouMyIP:)\n</pre></body></html>"
If you can't get a headless version of Chromium to help you, you could try node.js and jsdom (https://github.com/tmpvar/jsdom). Easy to install and play with once you have node up and running. You can see simple examples on Github README where they pull down a URL, run all javascript, including any custom javascript code (example: jQuery bits to count some type of elements), and then you have the HTML in memory to do what you want. You can just do $('body').html() and get a string, like in your pseudo code. (This even works for stuff like generating SVG graphics since that is just more XML tree nodes.)
If you need this as part of a larger C# app that you need to distribute, your idea to use CefSharp.Offscreen sounds reasonable. One approach might be to get things working with CefSharp.WinForms or CefSharp.WPF first, where you can literally see things, then try CefSharp.Offscreen later when this all works. You can even get some JavaScript running in the on-screen browser to pull down body.innerHTML and return it as a string to the C# side of things before you go headless. If that works, the rest should be easy.
Perhaps start with CefSharp.MinimalExample (https://github.com/cefsharp/CefSharp.MinimalExample) and get that compiling, then tweak it for your needs. You need to be able to set webBrowser.Address in your C# code, and you need to know when the page has Loaded, then you need to call webBrowser.EvaluateScriptAsync(".. JS code ..") with your JavaScript code (as a string) which will do something as described (returning bodyElement.innerHTML as a string).