For educational purposes, I'm trying to scrape this page gradually with Python and lxml, starting with movie names.
From what I've read so far from the Python docs on lxml and the W3Schools on XPath, this code should yield me all movie titles in a list:
from lxml import html
import requests
page = requests.get('http://www.rottentomatoes.com/browse/dvd-top-rentals/')
tree = html.fromstring(page.text)
movies = tree.xpath('//h3[@class="movieTitle"]/text()')
print movies
Basically, it should give me every h3 element anywhere in the document that has the attribute class
that has the value "movieTitle". Upon running the code though, I only get an empty list printed out.
I can't figure out why.
I tried by myself, so I ran:
movies = tree.xpath('//h3[@class]/text()')
print movies
Well this one should return any H3 with the attribute class, but it returns this list instead:
['From RT Users Like You!', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
I tried targeting the first string on this list by targeting its class value ("noSpacing center"), and it returned this sole string successfully. So I'm sure there's something that I misunderstand about lxml/XPath works. Can anyone point me in helpful direction? Thanks in advance!
Information on http://www.rottentomatoes.com/browse/dvd-top-rentals/ is not rendered directly into the page but loaded from XMLHttpRequests.
The API you are looking for, seems to be:
http://d3biamo577v4eu.cloudfront.net/api/private/v1.0/m/list/find?page=1&limit=30&type=dvd-top-rentals&services=amazon%3Bamazon_prime%3Bflixster%3Bhbo_go%3Bitunes%3Bnetflix_iw%3Bvudu&sortBy=popularity
And the query string is prepared depending on the selected filters.
So you must make requests to that endpoint (instead of the URL you are currently requesting) and parse JSON to extract desired data.
You should play with "page" GET variable to get the next ones.
Example with cURL + jq:
Example with Python + Requests:
Output:
Using
selenium
is a another way to wait till the page is fully loaded (i.e. including allJavaScript
manipulation). You don't have to useFirefox
, you can use other browsers or a headless browser likePhantom JS
if displaying the actual site is not required.Output: