After 2 days of debug, I nailed down my time-hog: the Python garbage collector.
My application holds a lot of objects in memory. And it works well.
The GC does the usual rounds (I have not played with the default thresholds of (700, 10, 10)).
Once in a while, in the middle of an important transaction, the 2nd generation sweep kicks in and reviews my ~1.5M generation 2 objects.
This takes 2 seconds!
The nominal transaction takes less than 0.1 seconds.
My question is what should I do?
I can turn off generation 2 sweeps (by setting a very high threshold - is this the right way?) and the GC is obedient.
When should I turn them on?
We implemented a web service using Django, and each user request takes about 0.1 seconds.
Optimally, I will run these GC gen 2 cycles between user API requests. But how do I do that?
My view ends with return HttpResponse()
, AFTER which I would like to run a gen 2 GC sweep.
How do I do that? Does this approach even make sense?
Can I mark the object that NEVER need to be garbage collected so the GC will not test them every 2nd gen cycle?
How can I configure the GC to run full sweeps when the Django server is relatively idle?
Python 2.6.6 on multiple platforms (Windows / Linux).
We did something like this for gunicorn. Depending on what wsgi server you use, you need to find the right hooks for AFTER the response, not before. Django has a request_finished
signal but that signal is still pre response.
For gunicorn, in the config you need to define 2 methods like so:
def pre_request(worker, req):
# disable gc until end of request
gc.disable()
def post_request(worker, req, environ, resp):
# enable gc after a request
gc.enable()
The post_request
here runs after the http response has been delivered, and so is a very good time for garbage collection.
I believe one option would be to completely disable garbage collection and then manually collect at the end of a request as suggested here: How does the Garbage Collection mechanism work?
I imagine that you could disable the GC in your settings.py
file.
If you want to run GarbageCollection on every request I would suggest developing some Middleware that does it in the process response method:
import gc
class GCMiddleware(object):
def process_response(self, request, response):
gc.collect()
return response
An alternative might be to disable GC altogether, and configure mod_wsgi (or whatever you're using) to kill and restart processes more frequently.
My view ends with return HttpResponse(), AFTER which I would like to run a gen 2 GC sweep.
// turn off GC
// do stuff
resp = HttpResponse()
// turn on GC
return resp
I'm not sure, but instead of //turn on GC
you might be able to // spawn thread to turn on GC in 0.1 sec
.
In order to make sure that GC doesn't happen until after the request is processed, if the thread spawning doesn't work, you would need to modify django itself or use some sort of django hook, as dcurtis suggested.
If you're dealing with performance-critical code, you might also want to consider using a manual memory management language like C/C++ for that part, and using Python simply to invoke/query it.