I want to crawl all he links present in the sitemap.xml of a fixed site. I've came across Scrapy's SitemapSpider. So far i've extracted all the urls in the sitemap. Now i want to crawl through each link of the sitemap. Any help would be highly useful. The code so far is:
class MySpider(SitemapSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
print response.url
You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want.
For instance say you have a page named http://www.xyz.nl//x/ you want to create a rule:
class MySpider(SitemapSpider):
name = 'xyz'
sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
# list with tuples - this example contains one page
sitemap_rules = [('/x/', parse_x)]
def parse_x(self, response):
sel = Selector(response)
paragraph = sel.xpath('//p').extract()
return paragraph
Essentially you could create new request objects to crawl the urls created by the SitemapSpider and parse the responses with a new callback:
class MySpider(SitemapSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
print response.url
return Request(response.url, callback=self.parse_sitemap_url)
def parse_sitemap_url(self, response):
# do stuff with your sitemap links