How to recursively crawl subpages with Scrapy

So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like:

Category 1 name
- Subcategory 1 name
  - data from this subcategory's page
- Subcategory n name
  - data from this page
Category n name
- Subcategory 1 name
  - data from subcategory n's page

etc.

Eventually i want to be able to use this data with ElasticSearch

I barely have any experience with Scrapy and this is what I have so far (just scrapes the category names from the first page, I have no idea what to do from here)... From my research I believe I need to use a CrawlSpider but am unsure of what that entails. I have also been suggested to use BeautifulSoup. Any help would be greatly appreciated.

class randomSpider(scrapy.Spider):
    name = "helpme"
    allowed_domains = ["example.com"]
    start_urls = ['http://example.com/categories',]

    def parse(self, response):
        for i in response.css('div.CategoryTreeSection'):
            yield {
                'categories': i.css('a::text').extract_first()
            }

标签： python beautifulsoup scrapy web-crawler scrapy-spider

1条回答

霸刀☆藐视天下

2楼-- · 2019-05-24 21:30

Not familiar with ElasticSearch but I'd build a scraper like this:

class randomSpider(scrapy.Spider):
    name = "helpme"
    allowed_domains = ["example.com"]
    start_urls = ['http://example.com/categories',]

    def parse(self, response):
        for i in response.css('div.CategoryTreeSection'):
            subcategory = i.css('Put your selector here') # This is where you select the subcategory url
            req = scrapy.Request(subcategory, callback=self.parse_subcategory)
            req.meta['category'] = i.css('a::text').extract_first()
            yield req

    def parse_subcategory(self, response):
        yield {
            'category' : response.meta.get('category')
            'subcategory' : response.css('Put your selector here') # Select the name of the subcategory
            'subcategorydata' : response.css('Put your selector here') # Select the data of the subcategory
        }

You collect the subcategory URL and send a request. The response of this request will be opened in parse_subcategory. While sending this request, we add the category name in the meta data.

In the parse_subcategory function you get the category name from the meta data and collect the subcategory data from the webpage.

0人赞添加讨论(0) 举报

How to recursively crawl subpages with Scrapy

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间