I am using scrapy
to crawl some websites. How to get the number of requests in the queue?
I have looked at the scrapy
source code and find scrapy.core.scheduler.Scheduler
may lead to my answer. See: https://github.com/scrapy/scrapy/blob/0.24/scrapy/core/scheduler.py
Two questions:
- How to access the scheduler in my spider class?
- What does the
self.dqs
andself.mqs
mean in the scheduler class?
This took me a while to figure out, but here's what I used:
self.crawler.engine.slot.scheduler
That is the instance of the scheduler. You can then call the
__len__()
method of it, or if you just need true/false for pending requests, do something like this:Beware that there could still be running requests even thought the queue is empty. To check how many requests are currently running use:
An approach to answer your questions:
From the documentation http://readthedocs.org/docs/scrapy/en/0.14/faq.html#does-scrapy-crawl-in-breath-first-or-depth-first-order
So
self.dqs
andself.mqs
are auto esplicative (disk queque scheduler and memory queue scheduler.From another SO answer there is a suggestion about accessing to the (Storing scrapy queue in a database) scrapy internale queque rappresentation
queuelib
https://github.com/scrapy/queuelibOnce you get it you just need to count the length of the queue.