I have an Azure Web App that I want to use to screen scrape a website when I call an Action on a controller, like so.
var driver = new PhantomJSDriver();
driver.Url = "http://url.com";
driver.Navigate();
var source = driver.PageSource;
var pathElement = driver.FindElementByXPath("//table[@class='someclassname']");
string innerHtml = "";
IJavaScriptExecutor js = driver as IJavaScriptExecutor;
if (js != null)
{
innerHtml = (string)js.ExecuteScript("return arguments[0].innerHTML;", pathElement);
}
return innerHtml;
This works fine locally, however when I upload to my Azure Web App, I get this error
Cannot start the driver service on http://localhost:51169/
I assume this has to do with firewalls since I need to approve PhantomJS in my firewall settings the first time the app runs. My question is how do I get this to work deployed in Azure? Is it even possible, or do I need to configure this as some Unit Test and run it from within Visual Studio?
PhantomJS does not work today in the sandbox that Azure Web Apps run under. See the wiki for a list of things that are known to not work currently, as well as lots of other information about the sandbox.
I will post here this snippet that works on Azure. It is however far from usable in production as I keep getting random connection errors such as:
Unable to connect to the remote server inner message: Unable to connect to the remote server inner message: An attempt was made to access a socket in a way forbidden by its access permissions
The exact same code works well on a console or windows application environment.
PhantomJSDriver driver = null;
PhantomJSDriverService service;
ServicePointManager.ServerCertificateValidationCallback = new
RemoteCertificateValidationCallback
(
delegate { return true; }
);
int retry = 0;
while (driver == null && retry < 3)
{
try
{
service = PhantomJSDriverService.CreateDefaultService();
var uri = service.ServiceUrl;
var port = service.Port;
service.LocalToRemoteUrlAccess = true;
var ghostDriverPath = service.GhostDriverPath;
service.HideCommandPromptWindow = true;
service.Start();
var options = new PhantomJSOptions();
driver = new PhantomJSDriver(service, options);
}
catch (Exception ex)
{
if (driver != null)
{
driver.Close();
driver.Quit();
driver = null;
}
Thread.Sleep(retry * 1500);
ServiceAudit.Default.TraceDebug($"Starting web driver failed on {retry} try");
}
retry++;
}
if (driver == null)
{
ServiceAudit.Default.TraceError($"Web driver could not be started");
}
return driver;
I would rethink your solution of using Selenium here. Selenium is used to automate manual testing of your webapp. Basically, automate filling out a form, click a button, etc.
Even if Selenium and your PhantomJS Driver does run on your Azure webapp without issues, you'll have a bottleneck of one browser per 1 Http request. I suspect you'll run into performance problems real soon.
Furthermore, the time it takes for drivers to load PhantomJS, request a page, interact, and close PhantomJS is slow.
In your case, it sounds like you're not interacting with your source site, you just need data. So perhaps just parsing the HTML DOM will suffice.
It sounds like what you should be doing is publishing to an Azure WebRole.
Check out this answer... https://stackoverflow.com/a/43532701/1486716