I wrote a spider, that worked brilliantly the first time. The second time I tried to run it, it didn't venture beyond the start_urls
. I tried to fetch
the url in scrapy shell
and create a HtmlXPathSelector
object from the returned response. That is when I got the error
So the steps were: `
[scrapy shell] fetch('http://example.com') #its something other than example.
[scrapy shell] from scrapy.selector import HtmlXPathSelector
[scrapy shell] hxs = HtmlXPathSelector(response)
---------------------------------------------------------------------------
Traceback:
AttributeError Traceback (most recent call last)
<ipython-input-3-a486208adf1e> in <module>()
----> 1 HtmlXPathSelector(response)
/home/codefreak/project-r42catalog/env-r42catalog/lib/python2.7/site-packages/scrapy/selector/lxmlsel.pyc in __init__(self, response, text, namespaces, _root, _expr)
29 body=unicode_to_str(text, 'utf-8'), encoding='utf-8')
30 if response is not None:
---> 31 _root = LxmlDocument(response, self._parser)
32
33 self.namespaces = namespaces
/home/codefreak/project-r42catalog/env-r42catalog/lib/python2.7/site-packages/scrapy/selector/lxmldocument.pyc in __new__(cls, response, parser)
25 if parser not in cache:
26 obj = object_ref.__new__(cls)
---> 27 cache[parser] = _factory(response, parser)
28 return cache[parser]
29
/home/codefreak/project-r42catalog/env-r42catalog/lib/python2.7/site-packages/scrapy/selector/lxmldocument.pyc in _factory(response, parser_cls)
11 def _factory(response, parser_cls):
12 url = response.url
---> 13 body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
14 parser = parser_cls(recover=True, encoding='utf8')
15 return etree.fromstring(body, parser=parser, base_url=url)
Error:
AttributeError: 'Response' object has no attribute 'body_as_unicode'
Am I overlooking something very obvious or stumbled upon a bug in scrapy?
body_as_unicode
is a method of TextResponse. TextResponse, or one of its subclasses such as HtmlResponse, will be created by scrapy if the http response contains textual content.In your case, the most likely explanation is that scrapy believes the response does not contain text.
Does the HTTP response from the server correctly set the Content-Type header? Does it render correctly in a browser? These questions will help understand if it's expected behavior or a bug.