Scraping many pages using scrapy

I am trying to scrape multiple webpages using scrapy. The link of the pages are like:

http://www.example.com/id=some-number

In the next page the number at the end is reduced by 1.

So I am trying to build a spider which navigates to the other pages and scrapes them too. The code that I have is given below:

import scrapy
import requests
from scrapy.http import Request

URL = "http://www.example.com/id=%d"
starting_number = 1000
number_of_pages = 500
class FinalSpider(scrapy.Spider):
    name = "final"
    allowed_domains = ['example.com']
    start_urls = [URL % starting_number]

    def start_request(self):
        for i in range (starting_number, number_of_pages, -1):
            yield Request(url = URL % i, callback = self.parse)

    def parse(self, response):
        **parsing data from the webpage**

This is running into an infinite loop where on printing the page number I am getting negative numbers. I think that is happening because I am requesting for a page within my parse() function.

But then the example given here works okay. Where am I going wrong?

标签： python web-scraping scrapy

1条回答

一夜七次

2楼-- · 2019-02-15 20:31

The first page requested is "http://www.example.com/id=1000" (starting_number)

It's response goes through parse() and with for i in range (0, 500): you are requesting http://www.example.com/id=999, http://www.example.com/id=998, http://www.example.com/id=997...http://www.example.com/id=500

self.page_number is a spider attribute, so when you're decrementing it's value, you have self.page_number == 500 after the first parse().

So when Scrapy calls parse for the response of http://www.example.com/id=999, you're generating requests for http://www.example.com/id=499, http://www.example.com/id=498, http://www.example.com/id=497...http://www.example.com/id=0

You guess what happens the 3rd time: http://www.example.com/id=-1, http://www.example.com/id=-2...http://www.example.com/id=-500

For each response, you're generating 500 requests.

You can stop the loop by testing self.page_number >= 0

Edit after OP question in comments:

No need for multiple threads, Scrapy works asynchronously and you can enqueue all your requests in an overridden start_requests() method (instead of requesting 1 page, and then returning Request istances in the parse method). Scrapy will take enough requests to fill it's pipeline, parse the pages, pick new requests to send and so on.

See start_requests documentation.

Something like this would work:

class FinalSpider(scrapy.Spider):
    name = "final"
    allowed_domains = ['example.com']
    start_urls = [URL % starting_number]
    def __init__(self):
        self.page_number = starting_number

    def start_requests(self):
        # generate page IDs from 1000 down to 501
        for i in range (self.page_number, number_of_pages, -1):
            yield Request(url = URL % i, callback=self.parse)

    def parse(self, response):
        **parsing data from the webpage**

0人赞添加讨论(0) 举报

Scraping many pages using scrapy

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间