The situation
I have a simple python script to get the HTML source for a given url:
browser = webdriver.PhantomJS()
browser.get(url)
content = browser.page_source
Occasionally, the url points to a page with slow-loading external resources (e.g. video files, or really slow advertising content).
Webdriver will wait until those resources are loaded before completing the .get(url)
request.
Note: For extraneous reasons, I need to do this with PhantomJS rather than requests
or urllib2
The question
I'd like to set a timeout on PhantomJS resource loading so that if the resource is taking too long to load, the browser just assumes it doesn't exist or whatever.
This would allow me to perform the subsequent .pagesource
query based on what the browser has loaded.
Documentation on webdriver.PhantomJS is very thin, and I haven't found a similar question on SO.
thanks in advance!
PhantomJS has provided resourceTimeout
, which might suit your needs. I quote from documentation here
(in milli-secs) defines the timeout after which any resource requested
will stop trying and proceed with other parts of the page.
onResourceTimeout callback will be called on timeout.
So in Ruby, you can do something like
require 'selenium-webdriver'
capabilities = Selenium::WebDriver::Remote::Capabilities.phantomjs("phantomjs.page.settings.resourceTimeout" => "5000")
driver = Selenium::WebDriver.for :phantomjs, :desired_capabilities => capabilities
I believe in Python, it's something like (untested, only provides the logic, you are the Python developer, hopefully you will figure out)
driver = webdriver.PhantomJS(desired_capabilities={'phantomjs.page.settings.resourceTimeout': '5000'})
Long Explanation below, so TLDR:
Current version of Selenium's Ghostdriver (in PhantomJS 1.9.8) ignores resourceTimeout option, use webdriver's implicitly_wait(), set_page_load_timeout() and wrap them under try-except block.
#Python
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
browser = webdriver.PhantomJS()
browser.implicitly_wait(3)
browser.set_page_load_timeout(3)
try:
browser.get("http://url_here")
except TimeoutException as e:
#Handle your exception here
print(e)
finally:
browser.quit()
Explanation
To provide PhantomJS page settings to Selenium, one can use webdriver's DesiredCapabilities such as:
#Python
from selenium import webdriver
cap = webdriver.DesiredCapabilities.PHANTOMJS
cap["phantomjs.page.settings.resourceTimeout"] = 1000
cap["phantomjs.page.settings.loadImages"] = False
cap["phantomjs.page.settings.userAgent"] = "faking it"
browser = webdriver.PhantomJS(desired_capabilities=cap)
//Java
DesiredCapabilities capabilities = DesiredCapabilities.phantomjs();
capabilities.setCapability("phantomjs.page.settings.resourceTimeout", 1000);
capabilities.setCapability("phantomjs.page.settings.loadImages", false);
capabilities.setCapability("phantomjs.page.settings.userAgent", "faking it");
WebDriver webdriver = new PhantomJSDriver(capabilities);
But, here's the catch: As in today (2014/Dec/11) with PhantomJS 1.9.8 and its embedded Ghostdriver, resourceTimeout won't be applied by Ghostdriver (See the Ghostdriver issue#380 in Github).
For a workaround, simply use Selenium's timeout functions/methods and wrap webdriver's get method in a try-except/try-catch block, e.g.
#Python
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
browser = webdriver.PhantomJS()
browser.implicitly_wait(3)
browser.set_page_load_timeout(3)
try:
browser.get("http://url_here")
except TimeoutException as e:
#Handle your exception here
print(e)
finally:
browser.quit()
//Java
WebDriver webdriver = new PhantomJSDriver();
webdriver.manage().timeouts()
.pageLoadTimeout(3, TimeUnit.SECONDS)
.implicitlyWait(3, TimeUnit.SECONDS);
try {
webdriver.get("http://url_here");
} catch (org.openqa.selenium.TimeoutException e) {
//Handle your exception here
System.out.println(e.getMessage());
} finally {
webdriver.quit();
}