import scrapy
from imdbscrape.items import MovieItem
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['imdb.com']
start_urls = ['https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc']
def parse(self, response):
urls = response.css('h3.lister-item-header > a::attr(href)').extract()
for url in urls:
yield scrapy.Request(url=response.urljoin(url),callback=self.parse_movie)
nextpg = response.css('div.desc > a::attr(href)').extract_first()
if nextpg:
nextpg = response.urljoin(nextpg)
yield scrapy.Request(url=nextpg,callback=self.parse)
def parse_movie(self, response):
item = MovieItem()
item['title'] = self.getTitle(response)
item['year'] = self.getYear(response)
item['rating'] = self.getRating(response)
item['genre'] = self.getGenre(response)
item['director'] = self.getDirector(response)
item['summary'] = self.getSummary(response)
item['actors'] = self.getActors(response)
yield item
I have wrote the above code for scraping all imdb movies from 2017 to till date. But this code only scrapes 100 movies. Please Help.
I believe the issue is with
On this page https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc
the code for the next page link is this
Your code grabs the href of the link with the anchor text Next >>
which is this
https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc&page=2&ref_=adv_nxt
you go to that page and you scrape the next 50 movies
however the html in the div with a class of desc has TWO links in it. Not one like the first page.
The first link is the previous link, not the next link.
What I would do is set a counter to 0.
Increment on a successful scrape.
If the counter is greater than 0 then grab the second link and goto that link and scrape the results on that page
If the counter is not greater than 0 then grab the first link and goto that and scrape the results on that page