So i am building this spider and it crawls fine, because i can log into the shell and go through the HTML page and test my Xpath queries.
Not sure what i am doing wrong. Any help would be appreciated. I have re installed Twisted, but nothing.
My spider looks like this -
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from spider_scrap.items import spiderItem
class spider(BaseSpider):
name="spider1"
#allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
def parse(self, response):
items = []
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[@id="search_results"]/div[1]/div')
for site in sites:
item = spiderItem()
item['title'] = site.select('div[2]/h2/a/text()').extract item['author'] = site.select('div[2]/span/a/text()').extract
item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()
items.append(item)
return items
When i run spider - scrapy crawl Spider1 i get the following error -
2012-09-25 17:56:12-0400 [scrapy] DEBUG: Enabled item pipelines:
2012-09-25 17:56:12-0400 [Spider1] INFO: Spider opened
2012-09-25 17:56:12-0400 [Spider1] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-09-25 17:56:12-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-09-25 17:56:12-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-09-25 17:56:15-0400 [Spider1] DEBUG: Crawled (200) <GET http://www.example.com> (refere
r: None)
2012-09-25 17:56:15-0400 [Spider1] ERROR: Spider error processing <GET http://www.example.com
s>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 368, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 464, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Python27\lib\site-packages\scrapy\spider.py", line 62, in parse
raise NotImplementedError
exceptions.NotImplementedError:
2012-09-25 17:56:15-0400 [Spider1] INFO: Closing spider (finished)
2012-09-25 17:56:15-0400 [Spider1] INFO: Dumping spider stats:
{'downloader/request_bytes': 231,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 186965,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 9, 25, 21, 56, 15, 326000),
'scheduler/memory_enqueued': 1,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2012, 9, 25, 21, 56, 12, 157000)}
2012-09-25 17:56:15-0400 [Spider1] INFO: Spider closed (finished)
2012-09-25 17:56:15-0400 [scrapy] INFO: Dumping global stats:
{}
Leo is right, the indenting is not correct. You probably have some tabs and spaces mixed up together in your script because you pasted some code and typed in other code yourself and your editor allowed for both tabs and spaces in the same file. Convert all tabs to spaces so it's more like:
For everyone who faces this problem, please, make sure you didn't rename parse() method like I did:
Otherwise it throws the same error:
I've spent like three hours trying to figure out -.-
your parse method is out of class code , use bellow mentioned code