We have Docker-based ECS services where once the process is up, it needs to synchronize application state before it is ready to start serving requests. This can take some time (a number of seconds after the process starts).
When using ECS Services, changing the task definition version triggers a rolling replacement of the tasks (good), but it does it too quickly. Once a task reaches a RUNNING
state, the next task is killed. But RUNNING
just means the process is started, it doesn't mean it's met all its own internal requirements to be ready to do work... in this case, not ready to serve requests
This entire update process happens so quickly that in some cases, all the old tasks are killed before any of the new tasks have finished loading their state, and we end up with an outage.
What is the best or correct way to ensure ECS Services doesn't terminate old/hot tasks until the new tasks are actually hot & fully online, and not simply that the container process is running?