I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider.
Thanks
I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider.
Thanks
Just remove all pipelines from main settings and use this inside spider.
This will define the pipeline to user per spider
class testSpider(InitSpider):
name = 'test'
custom_settings = {
'ITEM_PIPELINES': {
'app.MyPipeline': 400
}
}
Building on the solution from Pablo Hoffman, you can use the following decorator on the process_item
method of a Pipeline object so that it checks the pipeline
attribute of your spider for whether or not it should be executed. For example:
def check_spider_pipeline(process_item_method):
@functools.wraps(process_item_method)
def wrapper(self, item, spider):
# message template for debugging
msg = '%%s %s pipeline step' % (self.__class__.__name__,)
# if class is in the spider's pipeline, then use the
# process_item method normally.
if self.__class__ in spider.pipeline:
spider.log(msg % 'executing', level=log.DEBUG)
return process_item_method(self, item, spider)
# otherwise, just return the untouched item (skip this step in
# the pipeline)
else:
spider.log(msg % 'skipping', level=log.DEBUG)
return item
return wrapper
For this decorator to work correctly, the spider must have a pipeline attribute with a container of the Pipeline objects that you want to use to process the item, for example:
class MySpider(BaseSpider):
pipeline = set([
pipelines.Save,
pipelines.Validate,
])
def parse(self, response):
# insert scrapy goodness here
return item
And then in a pipelines.py
file:
class Save(object):
@check_spider_pipeline
def process_item(self, item, spider):
# do saving here
return item
class Validate(object):
@check_spider_pipeline
def process_item(self, item, spider):
# do validating here
return item
All Pipeline objects should still be defined in ITEM_PIPELINES in settings (in the correct order -- would be nice to change so that the order could be specified on the Spider, too).
I can think of at least four approaches:
scrapy settings
in between each invocation of your spiderdefault_settings['ITEM_PIPELINES']
on your command class to the pipeline list you want for that command. See line 6 of this example.process_item()
check what spider it's running against, and do nothing if it should be ignored for that spider. See the example using resources per spider to get you started. (This seems like an ugly solution because it tightly couples spiders and item pipelines. You probably shouldn't use this one.)You can use the name
attribute of the spider in your pipeline
class CustomPipeline(object)
def process_item(self, item, spider)
if spider.name == 'spider1':
# do something
return item
return item
Defining all pipelines this way can accomplish what you want.
The other solutions given here are good, but I think they could be slow, because we are not really not using the pipeline per spider, instead we are checking if a pipeline exists every time an item is returned (and in some cases this could reach millions).
A good way to completely disable (or enable) a feature per spider is using custom_setting
and from_crawler
for all extensions like this:
pipelines.py
from scrapy.exceptions import NotConfigured
class SomePipeline(object):
def __init__(self):
pass
@classmethod
def from_crawler(cls, crawler):
if not crawler.settings.getbool('SOMEPIPELINE_ENABLED'):
# if this isn't specified in settings, the pipeline will be completely disabled
raise NotConfigured
return cls()
def process_item(self, item, spider):
# change my item
return item
settings.py
ITEM_PIPELINES = {
'myproject.pipelines.SomePipeline': 300,
}
SOMEPIPELINE_ENABLED = True # you could have the pipeline enabled by default
spider1.py
class Spider1(Spider):
name = 'spider1'
start_urls = ["http://example.com"]
custom_settings = {
'SOMEPIPELINE_ENABLED': False
}
As you check, we have specified custom_settings
that will override the things specified in settings.py
, and we are disabling SOMEPIPELINE_ENABLED
for this spider.
Now when you run this spider, check for something like:
[scrapy] INFO: Enabled item pipelines: []
Now scrapy has completely disabled the pipeline, not bothering of its existence for the whole run. Check that this also works for scrapy extensions
and middlewares
.
I am using two pipelines, one for image download (MyImagesPipeline) and second for save data in mongodb (MongoPipeline).
suppose we have many spiders(spider1,spider2,...........),in my example spider1 and spider5 can not use MyImagesPipeline
settings.py
ITEM_PIPELINES = {'scrapycrawler.pipelines.MyImagesPipeline' : 1,'scrapycrawler.pipelines.MongoPipeline' : 2}
IMAGES_STORE = '/var/www/scrapycrawler/dowload'
And bellow complete code of pipeline
import scrapy
import string
import pymongo
from scrapy.pipelines.images import ImagesPipeline
class MyImagesPipeline(ImagesPipeline):
def process_item(self, item, spider):
if spider.name not in ['spider1', 'spider5']:
return super(ImagesPipeline, self).process_item(item, spider)
else:
return item
def file_path(self, request, response=None, info=None):
image_name = string.split(request.url, '/')[-1]
dir1 = image_name[0]
dir2 = image_name[1]
return dir1 + '/' + dir2 + '/' +image_name
class MongoPipeline(object):
collection_name = 'scrapy_items'
collection_url='snapdeal_urls'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'scraping')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
#self.db[self.collection_name].insert(dict(item))
collection_name=item.get( 'collection_name', self.collection_name )
self.db[collection_name].insert(dict(item))
data = {}
data['base_id'] = item['base_id']
self.db[self.collection_url].update({
'base_id': item['base_id']
}, {
'$set': {
'image_download': 1
}
}, upsert=False, multi=True)
return item
we can use some conditions in pipeline as this
# -*- coding: utf-8 -*-
from scrapy_app.items import x
class SaveItemPipeline(object):
def process_item(self, item, spider):
if isinstance(item, x,):
item.save()
return item