Scrapy :: Issues with JSON export

2019-05-28 12:48发布

问题:

So, I have spent quite a bit of time going through the Scrapy documentation and tutorials, and I have since been plugging away at a very basic crawler. However, I am not able to get the output into a JSON file. I feel like I am missing something obvious, but I haven't been able to turn anything up after looking at a number of other examples, and trying several different things out.

To be thorough, I will include all of the relevant code. What I am trying to get here is some specific items and their associated prices. The prices will change fairly often, and the items will change with much lower frequency.

Here is my items.py :

class CartItems(Item):
    url = Field()
    name = Field()
    price = Field()

And here is the spider :

from scrapy.selector import HtmlXPathSelector                                                                                                                                        
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field

from Example.items import CartItems

class DomainSpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/path/to/desired/page']


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        cart = CartItems()
        cart['url'] = hxs.select('//title/text()').extract()
        cart['name'] = hxs.select('//td/text()').extract()[1]
        cart['price'] = hxs.select('//td/text()').extract()[2]
        return cart

If for example I run hxs.select('//td/text()').extract()[1] from the Scrapy shell on the URL http://www.example.com/path/to/desired/page, then I get the following response:

u'Text field I am trying to download'

EDIT:

Okay, so I wrote a pipeline that follows one I found in the wiki (I somehow missed this section when I was digging through this the last few days), just altered to use JSON instead of XML.

from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.contrib.exporter import JsonItemExporter

class JsonExportPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_items.json' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = JsonItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

This does output a file "example.com_items.json", but all it contains is "[]". So, I something is still not right here. Is the issue with the spider, or is the pipeline not done correctly? Clearly I am missing something here, so if someone could nudge me in the right direction, or link me any examples that might help out, that would be most appreciated.

回答1:

JsonItemExporter is fairly simple:

class JsonItemExporter(JsonLinesItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs)
        self.file = file
        self.encoder = json.JSONEncoder(**kwargs)
        self.first_item = True

    def start_exporting(self):
        self.file.write("[")

    def finish_exporting(self):
        self.file.write("]")

    def export_item(self, item):
        if self.first_item:
            self.first_item = False
        else:
            self.file.write(',\n')
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(self.encoder.encode(itemdict))

So, i have two conclusions:

  1. File is created - your pipeline is active and hooks spider_opened and spider_closed events.

  2. process_item is never called. Maybe no item is scraped, so no item is passed to this pipeline?

Also, i think there is a bug in the code:

def spider_opened(self, spider):
    file = open('%s_items.json' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = JsonItemExporter(file)
    self.exporter.start_exporting()

self.exporter = JsonItemExporter(file) - doesn't this mean that there is only one exporter is active all the time? Once a spider is opened you create an exporter. While that spider is active another one can open, and self.exporter will be overwritten by a new exporter.



回答2:

I copied your code from JsonExportPipeline and tested on my machine. It works fine with my spider.

So I think you should check the page.

start_urls = ['http://www.example.com/path/to/desired/page']

Maybe your parse function has something wrong of extracting the content. Which is the function below:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    cart = CartItems()
    cart['url'] = hxs.select('//title/text()').extract()
    cart['name'] = hxs.select('//td/text()').extract()[1]
    cart['price'] = hxs.select('//td/text()').extract()[2]
    return cart