I've been stuck on this bug for a while, the following error message is as follows:
File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\http\request\__init__.py", line 61, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
exceptions.ValueError: Missing scheme in request url: h
Scrapy code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request
from spyder.items import SypderItem
import sys
import MySQLdb
import hashlib
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
# _*_ coding: utf-8 _*_
class some_Spyder(CrawlSpider):
name = "spyder"
def __init__(self, *a, **kw):
# catch the spider stopping
# dispatcher.connect(self.spider_closed, signals.spider_closed)
# dispatcher.connect(self.on_engine_stopped, signals.engine_stopped)
self.allowed_domains = "domainname.com"
self.start_urls = "http://www.domainname.com/"
self.xpaths = '''//td[@class="CatBg" and @width="25%"
and @valign="top" and @align="center"]
/table[@cellspacing="0"]//tr/td/a/@href'''
self.rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=(self.xpaths))),
Rule(SgmlLinkExtractor(allow=('cart.php?')), callback='parse_items'),
)
super(spyder, self).__init__(*a, **kw)
def parse_items(self, response):
sel = Selector(response)
items = []
listings = sel.xpath('//*[@id="tabContent"]/table/tr')
item = IgeItem()
item["header"] = sel.xpath('//td[@valign="center"]/h1/text()')
items.append(item)
return items
I'm pretty sure it's something to do with the url I'm asking scrapy to follow in the LinkExtractor. When extracting them in shell they looking something like this:
data=u'cart.php?target=category&category_id=826'
Compared to another url extracted from a working spider:
data=u'/path/someotherpath/category.php?query=someval'
I've had a look at a few questions on SO, such as Downloading pictures with scrapy but from reading it I think I may have a slightly different problem.
I also took a look at this - http://static.scrapy.org/coverage-report/scrapy_http_request___init__.html
which explains that the error is thrown up if self.urls is missing a ":", from looking at the start_urls I've defined I can't quite see why this error would show since the scheme is clearly defined.
Thanks for reading,
Toby
prepend url with 'http' or 'https'
Scheme basically has a syntax like
It's more clear in the description on that same definition page:
In the question of
Missing schemes
it appears that there is[//[user:password@]host[:port]]
part missing inas mentioned above.
I had a similar problem where this simple concept would suffice the solution for me!
Hope this helps some.
change
start_urls
to:it should work.
change
start_urls
to:As @Guy answered earlier,
start_urls
attribute must be a list, theexceptions.ValueError: Missing scheme in request url: h
message comes from that: the "h" in the error message is the first character of "http://www.bankofwow.com/", interpreted as a list (of characters)allowed_domains
must also be a list of domains, otherwise you'll get filtered "offsite" requests.Change
restrict_xpaths
toit should represent an area in the document where to find links, it should not be link URLs directly
From http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor
Finally, it's customary to define these as class attributes instead of settings those in
__init__
: