I'm using Celery from a webapp to start a task hierarchy.
Tasks
I'm using the following tasks:
task_a
task_b
task_c
notify_user
A Django view starts several task_a
instances. Each of them does some processing and then starts several task_b
instances. And each of those does some processing and then starts several task_c
instances.
To visualize:
Goals
My goal is to execute all the tasks, and to run a callback function as soon as the entire hierarchy has finished. Additionally, I want to be able to pass data from the lowest tasks to the top level.
- The view should just "kick off" the tasks and then return.
- Each subtask depends on the parent task. The parent task does not depend directly on the child task. After a parent task has started all the child tasks, it can be stopped.
- Everything can be parallelized, as long as the parent task runs before the child task is started.
- After all the tasks have finished, the
notify_user
callback function should be called. - The
notify_user
callback function needs access to data from thetask_c
s.
All the tasks should be non-blocking, so task_b
should not wait for all the task_c
subtasks to finish.
What would be the right way to achieve the goal stated above?
The solution turned out to be the dynamic task feature provided in this pull request: https://github.com/celery/celery/pull/817. With this, each task can return a group of subtasks, which will then replace the original taks in the queue.
Suppose you have these tasks:
For a given input data (as you drawn it):
You can do:
It's representation is (don't read it, it's ugly :)
And the data passed into notify_user would be:
Everything is run via callbacks (chords) so there are no tasks waiting for other tasks.