I've done a ton of research on this, and I'm surprised I haven't found a good answer to this yet anywhere.
I'm running a large application on Heroku, and I have certain celery tasks that run for a very long time processing, and at the end of the task save a result. Every time I redeploy on Heroku, it sends SIGTERM (and eventually, SIGKILL) and kills my running worker. I'm trying to find a way for the worker instance to shut itself down gracefully and re-queue itself for processing later so that eventually we can save the required result instead of losing the queued task.
I cannot find a way that works to have the worker listen for SIGTERM properly. The closest I've gotten, which works when running python manage.py celeryd
directly but NOT when emulating Heroku using foreman, is the following:
@app.task(bind=True, max_retries=1)
def slow(self, x):
try:
for x in range(100):
print 'x: ' + unicode(x)
time.sleep(10)
except exceptions.MaxRetriesExceededError:
logger.error('whoa')
except (exceptions.WorkerShutdown, exceptions.WorkerTerminate) as exc:
logger.error(u'retrying, ' + unicode(exc))
raise self.retry(exc=exc, countdown=10)
except (KeyboardInterrupt, SystemExit) as exc:
print 'retrying'
raise self.retry(exc=exc, countdown=10)
else:
return x
finally:
logger.info('task ended!')
When I start this celery task running within foreman and hit Ctrl+C, the following happens:
^CSIGINT received
22:20:59 system | sending SIGTERM to all processes
22:20:59 web.1 | exited with code 0
22:21:04 system | sending SIGKILL to all processes
Killed: 9
So it's clear that none of the celery exceptions, nor the KeyboardInterrupt
or SystemExit
exceptions I've seen in other posts, properly catch SIGTERM and shut down the worker.
What is the right way to do this?
celery was unfortunately not designed to do clean shutdown. EVER. I mean it. celery workers respond to SIGTERM but if a task is incomplete, the worker processes will wait to finish the task and only then exit. In which case, you can send it SIGKILL if the workers don't shut down in a reasonable time but there will be a loss of information in this case i.e. you may not know which jobs remained incomplete.
You can use acks_late or task_acks_late.
Tasks will be acknowledged from queue after task executed and not just before. So task will respawn if worker shutdown gracefully.