How to get the pipeline object in Scrapy spider

I have use the mongodb to store the data of the crawl.

Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html)

I want only one connection object to take the database operation, which is in pipeline.

So, I want to know how can I get the pipeline object(not new one) in the spider.

Or, any better solution for incremental update...

Thanks in advance.

Sorry, for my poor english... Just sample now:

# This is my Pipline
class MongoDBPipeline(object):
    def __init__(self, mongodb_db=None, mongodb_collection=None):
        self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        ....
    def process_item(self, item, spider):
        ....
    def get_date(self):
        ....

And the spider:

class Spider(Spider):
    name = "test"
    ....

    def parse(self, response):
        # Want to get the Pipeline object
        mongo = MongoDBPipeline() # if take this way, must a new Pipeline object
        mongo.get_date()          # In scrapy, it must have a Pipeline object for the spider
                                  # I want to get the Pipeline object, which created when scrapy started.

Ok, just don't want to new a new object....I admit I am an OCD..

标签： python mongodb scrapy

2条回答

迷人小祖宗

2楼-- · 2019-02-25 00:04

A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized. You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider. An example of the latter with your code is:

# This is my Pipline
class MongoDBPipeline(object):
    def __init__(self, mongodb_db=None, mongodb_collection=None):
        self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        ....

    def process_item(self, item, spider):
        ....
    def get_date(self):
        ....

    def open_spider(self, spider):
        spider.myPipeline = self

Then, in the spider:

class Spider(Spider):
    name = "test"

    def __init__(self):
        self.myPipeline = None

    def parse(self, response):
        self.myPipeline.get_date()

I don't think the __init__() method is necessary here, but I put it here to show that open_spider replaces it after initialization.

0人赞添加讨论(0) 举报

Summer. ? 凉城

3楼-- · 2019-02-25 00:11

According to the scrapy Architecture Overview:

The Item Pipeline is responsible for processing the items once they have been extracted (or scraped) by the spiders.

Basically that means that, first, scrapy spiders are working, then extracted items are going to the pipelines - no way to go backwards.

One possible solution would be, in the pipeline itself, check if the Item you've scraped is already in the database.

Another workaround would be to keep the list of urls you've crawled in the database, and, in the spider, check if you've already got the data from a url.

Since I'm not sure what do you mean by "start from the beginning" - I cannot suggest anything specific.

Hope at least this information helped.

0人赞添加讨论(0) 举报

How to get the pipeline object in Scrapy spider

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间