Is there any method to using seperate scrapy pipel

I wanna to fetch web pages under different domain, that means I have to use different spider under the command "scrapy crawl myspider". However, I have to use different pipeline logic to put the data into database since the content of web pages are different. But for every spider, they have to go through all of the pipelines which defined in settings.py. Is there have other elegant method to using seperate pipelines for each spider?

标签： python web-scraping scrapy scrapy-spider

3条回答

Melony?

2楼-- · 2019-03-21 16:15

A more robust solution; Can't remember where I found it but a scrapy dev proposed it somewhere.. Using this method allows you to have some pipeline run on all spiders by not using the wrapper. It also makes it so you don't have to duplicate the logic of checking whether or not to use the pipeline.

Wrapper:

def check_spider_pipeline(process_item_method):
    """
        This wrapper makes it so pipelines can be turned on and off at a spider level.
    """
    @functools.wraps(process_item_method)
    def wrapper(self, item, spider):
        if self.__class__ in spider.pipeline:
            return process_item_method(self, item, spider)
        else:
            return item

    return wrapper

Usage:

@check_spider_pipeline
def process_item(self, item, spider):
    ........
    ........
    return item

Spider usage:

pipeline = {some.pipeline, some.other.pipeline .....}

0人赞添加讨论(0) 举报

趁早两清

3楼-- · 2019-03-21 16:21

ITEM_PIPELINES setting is defined globally for all spiders in the project during the engine start. It cannot be changed per spider on the fly.

Here are some options to consider:

Change the code of pipelines. Skip/continue processing items returned by spiders in the process_item method of your pipeline, e.g.:

def process_item(self, item, spider): 
    if spider.name not in ['spider1', 'spider2']: 
        return item  

    # process item

Change the way you start crawling. Do it from a script, based on spider name passed as a parameter, override your ITEM_PIPELINES setting before calling crawler.configure().

See also:

Hope that helps.

0人赞添加讨论(0) 举报

再贱就再见

4楼-- · 2019-03-21 16:33

A slightly better version of the above is as follows. It is better because this way allows you to selectively turn on pipelines for different spiders more easily than the coding of 'not in ['spider1','spider2']' in the pipeline above.

In your spider class, add:

#start_urls=...
pipelines = ['pipeline1', 'pipeline2'] #allows you to selectively turn on pipelines within spiders
#...

Then in each pipeline, you can use the getattr method as magic. Add:

class pipeline1():  
    def process_item(self, item, spider):
       if 'pipeline1' not in getattr(spider, 'pipelines'):
          return item
       #...keep going as normal

0人赞添加讨论(0) 举报

Is there any method to using seperate scrapy pipel

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间