So, I have spent quite a bit of time going through the Scrapy documentation and tutorials, and I have since been plugging away at a very basic crawler. However, I am not able to get the output into a JSON file. I feel like I am missing something obvious, but I haven't been able to turn anything up after looking at a number of other examples, and trying several different things out.
To be thorough, I will include all of the relevant code. What I am trying to get here is some specific items and their associated prices. The prices will change fairly often, and the items will change with much lower frequency.
Here is my items.py :
class CartItems(Item):
url = Field()
name = Field()
price = Field()
And here is the spider :
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from Example.items import CartItems
class DomainSpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/path/to/desired/page']
def parse(self, response):
hxs = HtmlXPathSelector(response)
cart = CartItems()
cart['url'] = hxs.select('//title/text()').extract()
cart['name'] = hxs.select('//td/text()').extract()[1]
cart['price'] = hxs.select('//td/text()').extract()[2]
return cart
If for example I run hxs.select('//td/text()').extract()[1] from the Scrapy shell on the URL http://www.example.com/path/to/desired/page, then I get the following response:
u'Text field I am trying to download'
EDIT:
Okay, so I wrote a pipeline that follows one I found in the wiki (I somehow missed this section when I was digging through this the last few days), just altered to use JSON instead of XML.
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.contrib.exporter import JsonItemExporter
class JsonExportPipeline(object):
def __init__(self):
dispatcher.connect(self.spider_opened, signals.spider_opened)
dispatcher.connect(self.spider_closed, signals.spider_closed)
self.files = {}
def spider_opened(self, spider):
file = open('%s_items.json' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = JsonItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
This does output a file "example.com_items.json", but all it contains is "[]". So, I something is still not right here. Is the issue with the spider, or is the pipeline not done correctly? Clearly I am missing something here, so if someone could nudge me in the right direction, or link me any examples that might help out, that would be most appreciated.