How can I use different pipelines for different sp

2019-01-10 01:09发布

问题:

I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider.

Thanks

回答1:

Just remove all pipelines from main settings and use this inside spider.

This will define the pipeline to user per spider

class testSpider(InitSpider):
    name = 'test'
    custom_settings = {
        'ITEM_PIPELINES': {
            'app.MyPipeline': 400
        }
    }


回答2:

Building on the solution from Pablo Hoffman, you can use the following decorator on the process_item method of a Pipeline object so that it checks the pipeline attribute of your spider for whether or not it should be executed. For example:

def check_spider_pipeline(process_item_method):

    @functools.wraps(process_item_method)
    def wrapper(self, item, spider):

        # message template for debugging
        msg = '%%s %s pipeline step' % (self.__class__.__name__,)

        # if class is in the spider's pipeline, then use the
        # process_item method normally.
        if self.__class__ in spider.pipeline:
            spider.log(msg % 'executing', level=log.DEBUG)
            return process_item_method(self, item, spider)

        # otherwise, just return the untouched item (skip this step in
        # the pipeline)
        else:
            spider.log(msg % 'skipping', level=log.DEBUG)
            return item

    return wrapper

For this decorator to work correctly, the spider must have a pipeline attribute with a container of the Pipeline objects that you want to use to process the item, for example:

class MySpider(BaseSpider):

    pipeline = set([
        pipelines.Save,
        pipelines.Validate,
    ])

    def parse(self, response):
        # insert scrapy goodness here
        return item

And then in a pipelines.py file:

class Save(object):

    @check_spider_pipeline
    def process_item(self, item, spider):
        # do saving here
        return item

class Validate(object):

    @check_spider_pipeline
    def process_item(self, item, spider):
        # do validating here
        return item

All Pipeline objects should still be defined in ITEM_PIPELINES in settings (in the correct order -- would be nice to change so that the order could be specified on the Spider, too).



回答3:

I can think of at least four approaches:

  1. Use a different scrapy project per set of spiders+pipelines (might be appropriate if your spiders are different enough warrant being in different projects)
  2. On the scrapy tool command line, change the pipeline setting with scrapy settings in between each invocation of your spider
  3. Isolate your spiders into their own scrapy tool commands, and define the default_settings['ITEM_PIPELINES'] on your command class to the pipeline list you want for that command. See line 6 of this example.
  4. In the pipeline classes themselves, have process_item() check what spider it's running against, and do nothing if it should be ignored for that spider. See the example using resources per spider to get you started. (This seems like an ugly solution because it tightly couples spiders and item pipelines. You probably shouldn't use this one.)


回答4:

You can use the name attribute of the spider in your pipeline

class CustomPipeline(object)

    def process_item(self, item, spider)
         if spider.name == 'spider1':
             # do something
             return item
         return item

Defining all pipelines this way can accomplish what you want.



回答5:

The other solutions given here are good, but I think they could be slow, because we are not really not using the pipeline per spider, instead we are checking if a pipeline exists every time an item is returned (and in some cases this could reach millions).

A good way to completely disable (or enable) a feature per spider is using custom_setting and from_crawler for all extensions like this:

pipelines.py

from scrapy.exceptions import NotConfigured

class SomePipeline(object):
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        if not crawler.settings.getbool('SOMEPIPELINE_ENABLED'):
            # if this isn't specified in settings, the pipeline will be completely disabled
            raise NotConfigured
        return cls()

    def process_item(self, item, spider):
        # change my item
        return item

settings.py

ITEM_PIPELINES = {
   'myproject.pipelines.SomePipeline': 300,
}
SOMEPIPELINE_ENABLED = True # you could have the pipeline enabled by default

spider1.py

class Spider1(Spider):

    name = 'spider1'

    start_urls = ["http://example.com"]

    custom_settings = {
        'SOMEPIPELINE_ENABLED': False
    }

As you check, we have specified custom_settings that will override the things specified in settings.py, and we are disabling SOMEPIPELINE_ENABLED for this spider.

Now when you run this spider, check for something like:

[scrapy] INFO: Enabled item pipelines: []

Now scrapy has completely disabled the pipeline, not bothering of its existence for the whole run. Check that this also works for scrapy extensions and middlewares.



回答6:

I am using two pipelines, one for image download (MyImagesPipeline) and second for save data in mongodb (MongoPipeline).

suppose we have many spiders(spider1,spider2,...........),in my example spider1 and spider5 can not use MyImagesPipeline

settings.py

ITEM_PIPELINES = {'scrapycrawler.pipelines.MyImagesPipeline' : 1,'scrapycrawler.pipelines.MongoPipeline' : 2}
IMAGES_STORE = '/var/www/scrapycrawler/dowload'

And bellow complete code of pipeline

import scrapy
import string
import pymongo
from scrapy.pipelines.images import ImagesPipeline

class MyImagesPipeline(ImagesPipeline):
    def process_item(self, item, spider):
        if spider.name not in ['spider1', 'spider5']:
            return super(ImagesPipeline, self).process_item(item, spider)
        else:
           return item 

    def file_path(self, request, response=None, info=None):
        image_name = string.split(request.url, '/')[-1]
        dir1 = image_name[0]
        dir2 = image_name[1]
        return dir1 + '/' + dir2 + '/' +image_name

class MongoPipeline(object):

    collection_name = 'scrapy_items'
    collection_url='snapdeal_urls'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'scraping')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        #self.db[self.collection_name].insert(dict(item))
        collection_name=item.get( 'collection_name', self.collection_name )
        self.db[collection_name].insert(dict(item))
        data = {}
        data['base_id'] = item['base_id']
        self.db[self.collection_url].update({
            'base_id': item['base_id']
        }, {
            '$set': {
            'image_download': 1
            }
        }, upsert=False, multi=True)
        return item


回答7:

we can use some conditions in pipeline as this

    # -*- coding: utf-8 -*-
from scrapy_app.items import x

class SaveItemPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, x,):
            item.save()
        return item