I am using Amazon Web Services EC2 Container Service with an Application Load Balancer for my app. When I deploy a new version, I get 503 Service Temporarily Unavailable for about 2 minutes. It is a bit more than the startup time of my application. This means that I cannot do a zero-downtime deployment now.
Is there a setting to not use the new tasks while they are starting up? Or what am I missing here?
UPDATE:
The health check numbers for the target group of the ALB are the following:
Healthy threshold: 5
Unhealthy threshold: 2
Timeout: 5 seconds
Interval: 30 seconds
Success codes: 200 OK
Healthy threshold is 'The number of consecutive health checks successes required before considering an unhealthy target healthy'
Unhealthy threshold is 'The number of consecutive health check failures required before considering a target unhealthy.'
Timeout is 'The amount of time, in seconds, during which no response means a failed health check.'
Interval is 'The approximate amount of time between health checks of an individual target'
UPDATE 2: So, my cluster consists of two EC2 instances, but can scale up if needed. The desired and minimum count is 2. I run one task per instance, because my app needs a specific port number. Before I deploy (jenkins runs an aws cli script) I set the number of instances to 4. Without this, AWS cannot deploy my new tasks (this is another issue to solve). Networking mode is bridge.
You were speaking about Jenkins, so I'll answer with the Jenkins master service in mind, but my answer remains valid for any other case (even if it's not a good example for ECS, a Jenkins master doesn't scale correctly, so there can be only one instance).
503 bad gateway
I often encountered 503 gateway errors related to load balancer failing healthchecks (no healthy instance). Have a look at your load balancer monitoring tab to ensure that the count of healthy hosts is always above 0.
If you're doing an HTTP healthcheck, it must return a code 200 (the list of valid codes is configurable in the load balancer settings) only when your server is really up and running. Otherwise the load balancer could put at disposal instances that are not fully running yet.
If the issue is that you always get a 503 bad gateway, it may be because your instances take too long to answer (while the service is initializing), so ECS consider them as down and close them before their initialization is complete. That's often the case on Jenkins first run.
To avoid that last problem, you can consider adapting your load balancer ping target (healthcheck target for a classic load balancer, listener for an application load balancer):
One node per instance
If you need to be sure you have only one node per instance, you may use a classic load balancer (it also behaves well with ECS). With classic load balancers, ECS ensures that only one instance runs per server. That's also the only solution to have non HTTP ports accessible (for instance Jenkins needs 80, but also 50000 for the slaves).
However, as the ports are not dynamic with a classic load balancer, you have to do some port mapping, for example:
And of course you need one load balancer per service.
Jenkins healthcheck
If that's really a Jenkins service that you want to launch, you should use the Jenkins Metrics plugin to obtain a good healthcheck URL.
Install it, and in the global options, generate a token and activate the ping, and you should be able to reach an URL looking like this: http://myjenkins.domain.com/metrics/mytoken12b3ad1/ping
This URL will answer the HTTP code 200 only when the server is fully running, which is important for the load balancer to activate it only when it's completely ready.
Logs
Finally, if you want to know what is happening to your instance and why it is failing, you can add logs to see what the container is saying in AWS Cloudwatch.
Just add this in the task definition (container conf):
It will give you more insight about what is happening during a container initialization, if it just takes too long or if it is failing.
Hope it helps!!!
With your settings, you application start up should take more then 30 seconds in order to fail 2 health checks and be marked unhealthy (assuming first check immediately after your app went down). And it will take at least 2 minutes and up to 3 minutes then to be marked healthy again (first check immediately after your app came back online in the best case scenario or first check immediately before your app came back up in the worst case).
So, a quick and dirty fix is to increase Unhealthy threshold so that it won't be marked unhealthy during updates. And may be decrease Healthy threshold so that it is marked healthy again quicker.
But if you really want to achieve zero downtime, then you should use multiple instances of your app and tell AWS to stage deployments as suggested by Manish Joshi (so that there are always enough healthy instances behind your ELB to keep your site operational).
So, the issue seems to lie in the port mappings of my container settings in the task definition. Before I was using 80 as host and 8080 as container port. I thought I need to use these, but the host port can be any value actually. If you set it to 0 then ECS will assign a port in the range of 32768-61000 and thus it is possible to add multiple tasks to one instance. In order for this to work, I also needed to change my security group letting traffic come from the ALB to the instances on these ports.
So, when ECS can run multiple tasks on the same instance, the 50/200 min/max healthy percent makes sense and it is possible to do a deploy of new task revision without the need of adding new instances. This also ensures the zero-downtime deployment.
Thank you for everybody who asked or commented!
How i solved this was to have a flat file in the application root that the ALB would monitor to remain healthy. Before deployment, a script will remove this file while monitoring the node until it registers
OutOfService
.That way all live connection would have stopped and drained. At this point, the deployment is then started by stopping the node or application process. After deployment, the node is added back to the LB by adding back this flat file and monitored until it registers
Inservice
for this node before moving to the second node to complete same step above.My script looks as follow
Since you are using AWS ECS may I ask what is the service's "minimum health percent" and "maximum health percent"
Make sure that you have "maximum health percent" of 200 and "minimum health percent" of 50 so that during deployment not all of your services go down.
Please find the documentation definition of these two terms:
A limit of 50 for "minimum health percent" will make sure that only half of your services container gets killed before deploying the new version of the container, i.e. if the desired task value of the service is "2" than at the time of deployment only "1" container with old version will get killed first and once the new version is deployed the second old container will get killed and a new version container deployed. This will make sure that at any given time there are services handling the request.
Similarly a limit of 200 for "maximum health percent" tells the ecs-agent that at a given time during deployment the service's container can shoot up to a maximum of double of the desired task.
Please let me know in case of any further question.