Scraping many pages using scrapy

2019-02-15 19:45发布

I am trying to scrape multiple webpages using scrapy. The link of the pages are like:

http://www.example.com/id=some-number

In the next page the number at the end is reduced by 1.

So I am trying to build a spider which navigates to the other pages and scrapes them too. The code that I have is given below:

import scrapy
import requests
from scrapy.http import Request

URL = "http://www.example.com/id=%d"
starting_number = 1000
number_of_pages = 500
class FinalSpider(scrapy.Spider):
    name = "final"
    allowed_domains = ['example.com']
    start_urls = [URL % starting_number]

    def start_request(self):
        for i in range (starting_number, number_of_pages, -1):
            yield Request(url = URL % i, callback = self.parse)

    def parse(self, response):
        **parsing data from the webpage**

This is running into an infinite loop where on printing the page number I am getting negative numbers. I think that is happening because I am requesting for a page within my parse() function.

But then the example given here works okay. Where am I going wrong?

1条回答
一夜七次
2楼-- · 2019-02-15 20:31

The first page requested is "http://www.example.com/id=1000" (starting_number)

It's response goes through parse() and with for i in range (0, 500): you are requesting http://www.example.com/id=999, http://www.example.com/id=998, http://www.example.com/id=997...http://www.example.com/id=500

self.page_number is a spider attribute, so when you're decrementing it's value, you have self.page_number == 500 after the first parse().

So when Scrapy calls parse for the response of http://www.example.com/id=999, you're generating requests for http://www.example.com/id=499, http://www.example.com/id=498, http://www.example.com/id=497...http://www.example.com/id=0

You guess what happens the 3rd time: http://www.example.com/id=-1, http://www.example.com/id=-2...http://www.example.com/id=-500

For each response, you're generating 500 requests.

You can stop the loop by testing self.page_number >= 0


Edit after OP question in comments:

No need for multiple threads, Scrapy works asynchronously and you can enqueue all your requests in an overridden start_requests() method (instead of requesting 1 page, and then returning Request istances in the parse method). Scrapy will take enough requests to fill it's pipeline, parse the pages, pick new requests to send and so on.

See start_requests documentation.

Something like this would work:

class FinalSpider(scrapy.Spider):
    name = "final"
    allowed_domains = ['example.com']
    start_urls = [URL % starting_number]
    def __init__(self):
        self.page_number = starting_number

    def start_requests(self):
        # generate page IDs from 1000 down to 501
        for i in range (self.page_number, number_of_pages, -1):
            yield Request(url = URL % i, callback=self.parse)

    def parse(self, response):
        **parsing data from the webpage**
查看更多
登录 后发表回答