How can scrapy export items to separate csv files

2019-01-17 04:48发布

I am scraping a soccer site and the spider (a single spider) gets several kinds of items from the site's pages: Team, Match, Club etc. I am trying to use the CSVItemExporter to store these items in separate csv files, teams.csv, matches.csv, clubs.csv etc.

I am not sure what is the right way to do this. The only way I have thought so far is to create my own custom pipeline like in the example http://doc.scrapy.org/en/0.14/topics/exporters.html and there open all needed csv files in the spider_opened method, ie create a csv exporter for each csv file and in the process_item put code to figure out what kind of item is the "item" parameter and then send it to the corresponding exporter object.

Anyway I haven't found any examples of handling multiple csv files (per item type) in scrapy so I am worrying that I am using it in a way that is not meant to be used. (this is my first experience with Scrapy).

diomedes

2条回答
手持菜刀,她持情操
2楼-- · 2019-01-17 05:10

You approach seems fine to me. Piplines are a great feature of Scrapy and are IMO build for something like your approach.

You could create multiple items (e.g. SoccerItem, MatchItem) and in your MultiCSVItemPipeline just delegate each item to its own CSV class by checking the item class.

查看更多
聊天终结者
3楼-- · 2019-01-17 05:18

I am posting here the code I used to produce a MultiCSVItemPipeline based on the answer of drcolossos above.

This pipeline assumes that all the Item classes follow the convention *Item (e.g. TeamItem, EventItem) and creates team.csv, event.csv files and sends all records to the appropriate csv files.

from scrapy.exporters import CsvItemExporter
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher


def item_type(item):
    return type(item).__name__.replace('Item','').lower()  # TeamItem => team

class MultiCSVItemPipeline(object):
    SaveTypes = ['team','club','event', 'match']
    def __init__(self):
        dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
        dispatcher.connect(self.spider_closed, signal=signals.spider_closed)

    def spider_opened(self, spider):
        self.files = dict([ (name, open(CSVDir+name+'.csv','w+b')) for name in self.SaveTypes ])
        self.exporters = dict([ (name,CsvItemExporter(self.files[name])) for name in self.SaveTypes])
        [e.start_exporting() for e in self.exporters.values()]

    def spider_closed(self, spider):
        [e.finish_exporting() for e in self.exporters.values()]
        [f.close() for f in self.files.values()]

    def process_item(self, item, spider):
        what = item_type(item)
        if what in set(self.SaveTypes):
            self.exporters[what].export_item(item)
        return item
查看更多
登录 后发表回答