How to get the number of requests in queue in scra

2019-03-18 14:25发布

I am using scrapy to crawl some websites. How to get the number of requests in the queue?

I have looked at the scrapy source code and find scrapy.core.scheduler.Scheduler may lead to my answer. See: https://github.com/scrapy/scrapy/blob/0.24/scrapy/core/scheduler.py

Two questions:

  1. How to access the scheduler in my spider class?
  2. What does the self.dqs and self.mqs mean in the scheduler class?

标签: python scrapy
2条回答
Evening l夕情丶
2楼-- · 2019-03-18 15:02

This took me a while to figure out, but here's what I used:

self.crawler.engine.slot.scheduler

That is the instance of the scheduler. You can then call the __len__() method of it, or if you just need true/false for pending requests, do something like this:

self.crawler.engine.scheduler_cls.has_pending_requests(self.crawler.engine.slot.scheduler)

Beware that there could still be running requests even thought the queue is empty. To check how many requests are currently running use:

len(self.crawler.engine.slot.inprogress)
查看更多
Melony?
3楼-- · 2019-03-18 15:04

An approach to answer your questions:

From the documentation http://readthedocs.org/docs/scrapy/en/0.14/faq.html#does-scrapy-crawl-in-breath-first-or-depth-first-order

By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

So self.dqs and self.mqs are auto esplicative (disk queque scheduler and memory queue scheduler.

From another SO answer there is a suggestion about accessing to the (Storing scrapy queue in a database) scrapy internale queque rappresentation queuelib https://github.com/scrapy/queuelib

Once you get it you just need to count the length of the queue.

查看更多
登录 后发表回答