Scrapy crawl all sitemap links

2019-01-27 11:30发布

问题:

I want to crawl all he links present in the sitemap.xml of a fixed site. I've came across Scrapy's SitemapSpider. So far i've extracted all the urls in the sitemap. Now i want to crawl through each link of the sitemap. Any help would be highly useful. The code so far is:

class MySpider(SitemapSpider):
    name = "xyz"
    allowed_domains = ["xyz.nl"]
    sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 

    def parse(self, response):
        print response.url

回答1:

You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want. For instance say you have a page named http://www.xyz.nl//x/ you want to create a rule:

class MySpider(SitemapSpider):
    name = 'xyz'
    sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
    # list with tuples - this example contains one page 
    sitemap_rules = [('/x/', parse_x)]

    def parse_x(self, response):
        sel = Selector(response)
        paragraph = sel.xpath('//p').extract()

        return paragraph


回答2:

Essentially you could create new request objects to crawl the urls created by the SitemapSpider and parse the responses with a new callback:

class MySpider(SitemapSpider):
    name = "xyz"
    allowed_domains = ["xyz.nl"]
    sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 

    def parse(self, response):
        print response.url
        return Request(response.url, callback=self.parse_sitemap_url)

    def parse_sitemap_url(self, response):
        # do stuff with your sitemap links