What is the relationship between the crawler objec

2019-04-14 23:51发布

enter image description here

I'm working with scrapy. I have a pipieline that starts with:

class DynamicSQLlitePipeline(object):

    @classmethod
    def from_crawler(cls, crawler):
        # Here, you get whatever value was passed through the "table" parameter
        table = getattr(crawler.spider, "table")
        return cls(table)

    def __init__(self,table):
        try:
            db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db"
            db = dataset.connect(db_path)
            table_name = table[0:3]  # FIRST 3 LETTERS
            self.my_table = db[table_name]

I've been reading through https://doc.scrapy.org/en/latest/topics/api.html#crawler-api , which contains:

The main entry point to Scrapy API is the Crawler object, passed to extensions through the from_crawler class method. This object provides access to all Scrapy core components, and it’s the only way for extensions to access them and hook their functionality into Scrapy.

but still do not understand the from_crawler method, and the crawler object. What is the relationship between the crawler object with spider and pipeline objects? How and when is a crawler instantiated? Is a spider a subclass of crawler? I've asked Passing scrapy instance (not class) attribute to pipeline, but I don't understand how the pieces fit together.

标签: python scrapy
1条回答
Deceive 欺骗
2楼-- · 2019-04-15 00:35

Crawler is actually one of the most important objects in the Scrapy's architecture. It is a central piece of the crawling execution logic which "glues" a lot of other pieces together:

The main entry point to Scrapy API is the Crawler object, passed to extensions through the from_crawler class method. This object provides access to all Scrapy core components, and it’s the only way for extensions to access them and hook their functionality into Scrapy.

A crawler or multiple crawlers are controlled by the CrawlerRunner or the CrawlerProcess instance.

Now that from_crawler method which is available on lots of Scrapy components is just a way for these components to get access to the crawler instance that is running this particular component.

Also, look at the Crawler, CrawlerRunner and CrawlerProcess actual implementations.

And, what I personally found helpful in order to better understand how Scrapy works internally was to run a spider from a script - check out these detailed step-by-step instructions.

查看更多
登录 后发表回答