I'm working with scrapy. I have a pipieline that starts with:
class DynamicSQLlitePipeline(object):
@classmethod
def from_crawler(cls, crawler):
# Here, you get whatever value was passed through the "table" parameter
table = getattr(crawler.spider, "table")
return cls(table)
def __init__(self,table):
try:
db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db"
db = dataset.connect(db_path)
table_name = table[0:3] # FIRST 3 LETTERS
self.my_table = db[table_name]
I've been reading through https://doc.scrapy.org/en/latest/topics/api.html#crawler-api , which contains:
The main entry point to Scrapy API is the Crawler object, passed to extensions through the from_crawler class method. This object provides access to all Scrapy core components, and it’s the only way for extensions to access them and hook their functionality into Scrapy.
but still do not understand the from_crawler method, and the crawler object. What is the relationship between the crawler object with spider and pipeline objects? How and when is a crawler instantiated? Is a spider a subclass of crawler? I've asked Passing scrapy instance (not class) attribute to pipeline, but I don't understand how the pieces fit together.
Crawler
is actually one of the most important objects in the Scrapy's architecture. It is a central piece of the crawling execution logic which "glues" a lot of other pieces together:A crawler or multiple crawlers are controlled by the
CrawlerRunner
or theCrawlerProcess
instance.Now that
from_crawler
method which is available on lots of Scrapy components is just a way for these components to get access to thecrawler
instance that is running this particular component.Also, look at the
Crawler
,CrawlerRunner
andCrawlerProcess
actual implementations.And, what I personally found helpful in order to better understand how Scrapy works internally was to run a spider from a script - check out these detailed step-by-step instructions.