I'm trying to build a web scraper in Python using Selenium webdriver but I'm unable to access the information I need when I retrieve the website source code from webdriver.
I think the issue is that content is added to the page via JavaScript once the page has initially been downloaded from the server. When I run browser.page_source
I get the source code of the page before this content was added. I want to know whether it is possible to get the source code of the page after the content loaded with JavaScript has been added (in other words what I see when I look at the page using Inspect Element).
Here is the basic Python script I'm using:
from selenium import webdriver
browser = webdriver.Chrome()
browser.get("https://www.opportunities.auckland.ac.nz")
print(browser.page_source)
When I run the above script I get the source code of the page I see when looking at the page source in the browser (i.e. without the additional content visible when the code is viewed with inspect element).
Things I've tried
- Adding
time.sleep(10)
in various places in case the page had not fully loaded when I was accessing the source. - Using
get_attribute("innerHTML")
on the body. - Using
execute_script()
to make the JS run. - Using
execute_script()
to make the JS scripts run one by one.
It would be great if someone could tell be firstly whether this is possible and if it is point me in the right direction. Thanks.
Update 1
I get the following output when trying Piotrek's solution:
Warning (from warnings module):
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/phantomjs/webdriver.py", line 49
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
<html><head></head><body></body></html>
Unfortunately this seems not to work.
I've ran into a similar problem once, what helped me was using
PhantomJS()
instead ofChrome()
(even though Selenium support for PhantomJS has been deprecated):The desired elements are within an
<iframe>
, so you have to use WebDriverWait for the iframe to be available, and then switch to it, then again use WebDriverWait for the elements to be visible.You can use following solution:
Code Block:
Console Output: