below is my spider code,
class Blurb2Spider(BaseSpider):
name = "blurb2"
allowed_domains = ["www.domain.com"]
def start_requests(self):
yield self.make_requests_from_url("http://www.domain.com/bookstore/new")
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//div[@class="bookListingBookTitle"]/a/@href').extract()
for i in urls:
yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
print response,'------->'
Here i am trying to combine the href link with the base link , but i am getting the following error ,
exceptions.ValueError: Missing scheme in request url: www.domain.com//bookstore/detail/3271993?alt=Something+I+Had+To+Do
Can anyone let me know why i am getting this error and how to join base url with href link and yield a request
It is because you didn't add the scheme, eg http:// in your base url.
Try:
urlparse.urljoin('http://www.domain.com/', i[1:])
Or even more easy:
urlparse.urljoin(response.url, i[1:])
as urlparse.urljoin will sort out the base URL itself.An alternative solution, if you don't want to use
urlparse
:This solution goes even a step further: here Scrapy works out the domain base for joining. And as you can see, you don't have to provide the obvious
http://www.example.com
for joining.This makes your code reusable in the future if you want to change the domain you are crawling.