I am using sitemap spider in scrapy, python. The sitemap seems to have unusual format with '//' in front of urls:
<url>
<loc>//www.example.com/10/20-baby-names</loc>
</url>
<url>
<loc>//www.example.com/elizabeth/christmas</loc>
</url>
myspider.py
from scrapy.contrib.spiders import SitemapSpider
from myspider.items import *
class MySpider(SitemapSpider):
name = "myspider"
sitemap_urls = ["http://www.example.com/robots.txt"]
def parse(self, response):
item = PostItem()
item['url'] = response.url
item['title'] = response.xpath('//title/text()').extract()
return item
I am getting this error:
raise ValueError('Missing scheme in request url: %s' % self._url)
exceptions.ValueError: Missing scheme in request url: //www.example.com/10/20-baby-names
How can I manually parse the url using sitemap spider?
If I see it correctly, you could (for a quick solution) override the default implementation of
_parse_sitemap
inSitemapSpider
. It's not nice, because you will have to copy a lot of code, but should work. You'll have to add a method to generate a URL with scheme.This is just a general idea and untested. So it could both either totally not work or there could be syntax errors. Please respond via comments, so I can improve my answer.
The sitemap you are trying to parse, seems to be wrong. From RFC a missing scheme is perfectly fine, but sitemaps require URLs to begin with a scheme.
I used the trick by @alecxe to parse the urls within the spider. I made it work but not sure if it is the best way to do it.
I think the nicest and cleanest solution would be to add a downloader middleware which changes the malicious URLs without the spider noticing.