I am trying to scrape multiple webpages using scrapy. The link of the pages are like:
http://www.example.com/id=some-number
In the next page the number at the end is reduced by 1.
So I am trying to build a spider which navigates to the other pages and scrapes them too. The code that I have is given below:
import scrapy
import requests
from scrapy.http import Request
URL = "http://www.example.com/id=%d"
starting_number = 1000
number_of_pages = 500
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def start_request(self):
for i in range (starting_number, number_of_pages, -1):
yield Request(url = URL % i, callback = self.parse)
def parse(self, response):
**parsing data from the webpage**
This is running into an infinite loop where on printing the page number I am getting negative numbers. I think that is happening because I am requesting for a page within my parse()
function.
But then the example given here works okay. Where am I going wrong?
The first page requested is "http://www.example.com/id=1000" (
starting_number
)It's response goes through
parse()
and withfor i in range (0, 500):
you are requestinghttp://www.example.com/id=999
,http://www.example.com/id=998
,http://www.example.com/id=997
...http://www.example.com/id=500
self.page_number
is a spider attribute, so when you're decrementing it's value, you haveself.page_number == 500
after the firstparse()
.So when Scrapy calls
parse
for the response ofhttp://www.example.com/id=999
, you're generating requests forhttp://www.example.com/id=499
,http://www.example.com/id=498
,http://www.example.com/id=497
...http://www.example.com/id=0
You guess what happens the 3rd time:
http://www.example.com/id=-1
,http://www.example.com/id=-2
...http://www.example.com/id=-500
For each response, you're generating 500 requests.
You can stop the loop by testing
self.page_number >= 0
Edit after OP question in comments:
No need for multiple threads, Scrapy works asynchronously and you can enqueue all your requests in an overridden
start_requests()
method (instead of requesting 1 page, and then returningRequest
istances in theparse
method). Scrapy will take enough requests to fill it's pipeline, parse the pages, pick new requests to send and so on.See start_requests documentation.
Something like this would work: