Efficient handling of long running HTTP connection

I am working on a web service implemented on top of nginx+gunicorn+django. The clients are smartphone applications. The application needs to make some long running calls to external APIs (Facebook, Amazon S3...), so the server simply queues the job to a job server (using Celery over Redis).

Whenever possible, once the server has queued the job, it returns right away, and the HTTP connection is closed. This works fine and allows the server to sustain very high load.

client                   server                 job server
  .                        |                        |
  .                        |                        |
  |------HTTP request----->|                        |
  |                        |--------queue job------>|
  |<--------close----------|                        |
  .                        |                        |
  .                        |                        |

But in some cases, the client needs to get the result as soon as the job is finished. Unfortunately, there's no way the server can contact the client once the HTTP connection is closed. One solution would be to rely on the client application polling the server every few seconds until the job is completed. I would like to avoid this solution, if possible, mostly because it would hinder the reactiveness of the service, and also because it would load the server with many unnecessary poll requests.

In short, I would like to keep the HTTP connection up and running, doing nothing (except perhaps sending a whitespace every once in a while to keep the TCP connection alive, just like Amazon S3 does), until the job is done, and the server returns the result.

client                   server                 job server
  .                        |                        |
  .                        |                        |
  |------HTTP request----->|                        |
  |                        |--------queue job------>|
  |<------keep-alive-------|                        |
  |         [...]          |                        |
  |<------keep-alive-------|                        |
  |                        |<--------result---------|
  |<----result + close-----|                        |
  .                        |                        |
  .                        |                        |

How can I implement long running HTTP connections in an efficient way, assuming the server is under very high load (it is not the case yet, but the goal to be able to sustain the highest possible load, with hundreds or thousands of requests per second)?

Offloading the actual jobs to other servers should ensure a low CPU usage on the server, but how can I avoid processes piling up and using all the server's RAM, or incoming requests being dropped because of too many open connections?

This is probably mostly a matter of configuring nginx and gunicorn properly. I have read a bit about async workers based on greenlets in gunicorn: the documentation says that async workers are used by "Applications making long blocking calls (Ie, external web services)", this sounds perfect. It also says "In general, an application should be able to make use of these worker classes with no changes". This sounds great. Any feedback on this?

Thanks for your advices.

I'm answering my own question, perhaps someone has a better solution.

Reading gunicorn's documentation a bit further, and reading a bit more about eventlet and gevent, I think that gunicorn answers my question perfectly. Gunicorn has a master process that manages a pool of workers. Each worker can be either synchronous (single threaded, handling one request at a time) or asynchronous (each worker actually handles multiple requests almost simultaneously).

Synchronous workers are very simple to understand and to debug, and if a worker fails, only one request is lost. But if a worker is stuck in a long running external API call, it is basically sleeping. So in case of high load, all workers might end up sleeping while waiting for results, and requests will end up being dropped.

So the solution is to change the default worker type from synchronous to asynchronous (choosing eventlet or gevent, here's a comparison). Now each worker runs multiple green threads, each of which is extremely lightweight. Whenever one thread has to wait for some I/O, another green thread resumes execution. This is called cooperative multitasking. It's very fast, and very lightweight (a single worker could handle thousands of concurrent requests, if they are waiting for I/O). Exactly what I need.

I was wondering how I should change my existing code, but apparently the standard python modules are monkey-patched by gunicorn upon startup (actually by eventlet or gevent) so all existing code can run without change and still behave nicely with other threads.

There are a bunch of parameters which can be tweaked in gunicorn, for example the maximum number of simultaneous clients using gunicorn's worker_connections parameter, the maximum number of pending connections using the backlog parameter, etc.

This is just great, I'll start testing right away!