Formatting Scrapy's output to XML

2019-05-01 16:52发布

问题:

So I am attempting to export data scraped from a website using Scrapy to be in a particular format when I export it to XML.

Here is what I would like my XML to look like:

<?xml version="1.0" encoding="UTF-8"?>
<data>
  <row>
    <field1><![CDATA[Data Here]]></field1>
    <field2><![CDATA[Data Here]]></field2>
  </row>
</data>

I am running my scrape by using the command:

$ scrapy crawl my_scrap -o items.xml -t xml

The current output I am getting is along the lines of:

<?xml version="1.0" encoding="utf-8"?>
<items><item><field1><value>Data Here</value></field1><field2><value>Data Here</value></field2></item>

As you can see it is adding the <value> fields and I am not able to rename the root nodes or item nodes. I know that I need to use XmlItemExporter, but I am not sure how to go about implementing this in my project.

I have tried to add it to the pipelines.py as it is shown here but I always end up with with the error:

AttributeError: 'CrawlerProcess' object has no attribute 'signals'

Does any body know of examples of how to reformat the data when exporting it to XML using the XmlItemExporter?

Edit:

Showing my XmlItemExporter in my piplines.py module:

from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter

class XmlExportPipeline(object):

    def __init__(self):
        self.files = {}

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        file = open('%s_products.xml' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = XmlItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

Edit (Showing modifications and Traceback):

I modified the spider_opened function:

 def spider_opened(self, spider):
        file = open('%s_products.xml' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = XmlItemExporter(file, 'data', 'row')
        self.exporter.start_exporting()   

The trace back I get is:

Traceback (most recent call last):
          File "/root/self_opportunity/venv/lib/python2.6/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/core/engine.py", line 265, in <lambda>
            spider=spider, reason=reason, spider_stats=self.crawler.stats.get_stats()))
          File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred
            return signal.send_catch_log_deferred(*a, **kw)
          File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred
            *arguments, **named)
        --- <exception caught here> ---
          File "/root/self_opportunity/venv/lib/python2.6/site-packages/twisted/internet/defer.py", line 134, in maybeDeferred
            result = f(*args, **kw)
          File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 47, in robustApply
            return receiver(*arguments, **named)
          File "/root/self_opportunity/self_opportunity/pipelines.py", line 28, in spider_closed
            self.exporter.finish_exporting()
        exceptions.AttributeError: 'XmlExportPipeline' object has no attribute 'exporter'

回答1:

You can make XmlItemExporter do most of what you want simply by supplying the names of the nodes you want:

XmlItemExporter(file, 'data', 'row')

See the documentation.

The problem you have with value elements in your fields is because those fields are not scalar values. If XmlItemExporter encounters a scalar value, it simply outputs <fieldname>data</fieldname>, but if it encounters an iterable value, it will serialize like this: <fieldname><value>data1</value><value>data2</value></fieldname>. The solution is to stop emitting non-scalar field values for your items.

If you aren't willing to do this, subclass XmlItemExporter and override its _export_xml_field method to do what you want when the item value is iterable. This is the code for XmlItemExporter so you can see the implementation.