Many of my views fetch external resources. I want to make sure that under heavy load I don't blow up the remote sites (and/or get banned).
I only have 1 crawler so having a central lock will work fine.
So the details: I want to allow at most 3 queries to a host per second, and have the rest block for a maximum of 15 seconds. How could I do this (easily)?
Some thoughts :
- Use django cache
- Seems to only have 1 second resolution
- Use a file based semaphore
- Easy to do locks for concurrency. Not sure how to make sure only 3 fetches happen a second.
- Use some shared memory state
- I'd rather not install more things, but will if I have to.
What about using a different process to handle scraping, and a queue for the communication between it and Django?
This way you would be able to easily change the number of concurrent requests, and it would also automatically keep track of the requests, without blocking the caller.
Most of all, I think it would help lowering the complexity of the main application (in Django).
One approach; create a table like this:
This records when each query has either taken place, or will take place in the future if the limiting prevents it from happening immediately. start_time is the time the action is to start; this is in the future if the action is currently blocking.
Instead of thinking in terms of queries per second, let's think in terms of seconds per query; in this case, 1/3 second per query.
Whenever an action is to be performed, do the following:
start_time
to the greatest start_time for this site plus 1/3 second. If the greatest is 10 seconds in the future, then we can start our action at 10 1/3 seconds. If that time is in the past, clamp it to now().The atomic action is what's important. You can't simply do an aggregate on Queries and then save it, since it'll race. I don't know if Django can do this natively, but it's easy enough in raw SQL:
Then, reload the model and sleep if necessary. You'll also need to purge old rows. Something like Queries.objects.filter(site=site, finished=True).exclude(id=id).delete() will probably work: delete all finished queries except the one you just made. (That way, you never delete the latest query, since later queries need that to be scheduled.)
Finally, make sure the UPDATE doesn't take place in a transaction. Autocommit must be turned on for this to work. Otherwise, the UPDATE won't be atomic: it'd be possible for two requests to UPDATE at the same time, and receive the same result. Django and Python typically have autocommit off, so you need to turn it on and then back off. With Postgres, this is connection.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT) and ISOLATION_LEVEL_READ_COMMITTED. I don't know how to do this with MySQL.
(I consider the default of having autocommit turned off in Python's DB-API to be a seriously design flaw.)
The benefit of this approach is that it's quite simple, with straightforward state; you don't need things like event listeners and wakeups, which have their own sets of problems.
A possible issue is that if the user cancels the request during the delay, whether or not you do the action, the delay is still enforced. If you never start the action, other requests won't move down into the unused "timeslot".
If you're not able to get autocommit to work, a workaround would be to add a UNIQUE constraint to (site, start_time). (I don't think Django understands that directly, so you'd need to add the constraint yourself.) Then, if the race happens and two requests to the same site end up at the same time, one of them will throw a constraint exception that you can catch, and you can just retry. You could also use a normal Django aggregate instead of raw SQL. Catching constraint exceptions isn't as robust, though.