I have to scrape all movies from this IMDb page : https://www.imdb.com/list/ls055386972/.
My approach is first to scrape all the values of <a href="/title/tt0068646/?ref_=ttls_li_tt"
, i.e., to extract /title/tt0068646/?ref_=ttls_li_tt
portions and then add 'https://www.imdb.com' to prepare the complete URL to the movie, i.e., https://www.imdb.com/title/tt0068646/?ref_=ttls_li_tt . But whenever I am giving response.xpath('//h3[@class]/a[@href]').extract()
it is extracting the desired portion along with the movie title: [u'<a href="/title/tt0068646/?ref_=ttls_li_tt">The Godfather</a>', u'<a href="/title/tt0108052/?ref_=ttls_li_tt">Schindler\'s List</a>......]'
I want only the "/title/tt0068646/?ref_=ttls_li_tt"
portion.
How to proceed?
OUTPUT:
it is the working code please try:
I would suggest you to use requests-html to get all the hyperlinks and remove the ones that doesn't match your criteria. You can even get the absolute url using
r.html.absolute_links