AWS ECS 503 Service Temporarily Unavailable while

2019-03-25 01:14发布

问题:

I am using Amazon Web Services EC2 Container Service with an Application Load Balancer for my app. When I deploy a new version, I get 503 Service Temporarily Unavailable for about 2 minutes. It is a bit more than the startup time of my application. This means that I cannot do a zero-downtime deployment now.

Is there a setting to not use the new tasks while they are starting up? Or what am I missing here?

UPDATE:

The health check numbers for the target group of the ALB are the following:

Healthy threshold:     5
Unhealthy threshold:   2
Timeout:               5 seconds
Interval:              30 seconds
Success codes:         200 OK

Healthy threshold is 'The number of consecutive health checks successes required before considering an unhealthy target healthy'
Unhealthy threshold is 'The number of consecutive health check failures required before considering a target unhealthy.'
Timeout is 'The amount of time, in seconds, during which no response means a failed health check.'
Interval is 'The approximate amount of time between health checks of an individual target'

UPDATE 2: So, my cluster consists of two EC2 instances, but can scale up if needed. The desired and minimum count is 2. I run one task per instance, because my app needs a specific port number. Before I deploy (jenkins runs an aws cli script) I set the number of instances to 4. Without this, AWS cannot deploy my new tasks (this is another issue to solve). Networking mode is bridge.

回答1:

So, the issue seems to lie in the port mappings of my container settings in the task definition. Before I was using 80 as host and 8080 as container port. I thought I need to use these, but the host port can be any value actually. If you set it to 0 then ECS will assign a port in the range of 32768-61000 and thus it is possible to add multiple tasks to one instance. In order for this to work, I also needed to change my security group letting traffic come from the ALB to the instances on these ports.
So, when ECS can run multiple tasks on the same instance, the 50/200 min/max healthy percent makes sense and it is possible to do a deploy of new task revision without the need of adding new instances. This also ensures the zero-downtime deployment.

Thank you for everybody who asked or commented!



回答2:

Since you are using AWS ECS may I ask what is the service's "minimum health percent" and "maximum health percent"

Make sure that you have "maximum health percent" of 200 and "minimum health percent" of 50 so that during deployment not all of your services go down.

Please find the documentation definition of these two terms:

Maximum percent provides an upper limit on the number of running tasks during a deployment enabling you to define the deployment batch size.

Minimum healthy percent provides a lower limit on the number of running tasks during a deployment enabling you to deploy without using additional cluster capacity.

A limit of 50 for "minimum health percent" will make sure that only half of your services container gets killed before deploying the new version of the container, i.e. if the desired task value of the service is "2" than at the time of deployment only "1" container with old version will get killed first and once the new version is deployed the second old container will get killed and a new version container deployed. This will make sure that at any given time there are services handling the request.

Similarly a limit of 200 for "maximum health percent" tells the ecs-agent that at a given time during deployment the service's container can shoot up to a maximum of double of the desired task.

Please let me know in case of any further question.



回答3:

With your settings, you application start up should take more then 30 seconds in order to fail 2 health checks and be marked unhealthy (assuming first check immediately after your app went down). And it will take at least 2 minutes and up to 3 minutes then to be marked healthy again (first check immediately after your app came back online in the best case scenario or first check immediately before your app came back up in the worst case).

So, a quick and dirty fix is to increase Unhealthy threshold so that it won't be marked unhealthy during updates. And may be decrease Healthy threshold so that it is marked healthy again quicker.

But if you really want to achieve zero downtime, then you should use multiple instances of your app and tell AWS to stage deployments as suggested by Manish Joshi (so that there are always enough healthy instances behind your ELB to keep your site operational).



回答4:

How i solved this was to have a flat file in the application root that the ALB would monitor to remain healthy. Before deployment, a script will remove this file while monitoring the node until it registers OutOfService.

That way all live connection would have stopped and drained. At this point, the deployment is then started by stopping the node or application process. After deployment, the node is added back to the LB by adding back this flat file and monitored until it registers Inservice for this node before moving to the second node to complete same step above.

My script looks as follow

# Remove Health Check target
echo -e "\nDisabling the ELB Health Check target and waiting for OutOfService\n"
rm -f /home/$USER/$MYAPP/server/public/alive.html

# Loop until the Instance is Out Of Service
while true
do
        RESULT=$(aws elb describe-instance-health --load-balancer-name $ELB --region $REGION --instances $AMAZONID)
        if echo $RESULT | grep -qi OutOfService ; then
                echo "Instance is Deattached"
                break
        fi
        echo -n ". "
        sleep $INTERVAL
done


回答5:

You were speaking about Jenkins, so I'll answer with the Jenkins master service in mind, but my answer remains valid for any other case (even if it's not a good example for ECS, a Jenkins master doesn't scale correctly, so there can be only one instance).

503 bad gateway

I often encountered 503 gateway errors related to load balancer failing healthchecks (no healthy instance). Have a look at your load balancer monitoring tab to ensure that the count of healthy hosts is always above 0.

If you're doing an HTTP healthcheck, it must return a code 200 (the list of valid codes is configurable in the load balancer settings) only when your server is really up and running. Otherwise the load balancer could put at disposal instances that are not fully running yet.

If the issue is that you always get a 503 bad gateway, it may be because your instances take too long to answer (while the service is initializing), so ECS consider them as down and close them before their initialization is complete. That's often the case on Jenkins first run.

To avoid that last problem, you can consider adapting your load balancer ping target (healthcheck target for a classic load balancer, listener for an application load balancer):

  • With an application load balancer, try with something that will always return 200 (for Jenkins it may be a public file like /robots.txt for example).
  • With a classic load balancer, use a TCP port test rather than a HTTP test. It will always succeed if you have opened the port correctly.

One node per instance

If you need to be sure you have only one node per instance, you may use a classic load balancer (it also behaves well with ECS). With classic load balancers, ECS ensures that only one instance runs per server. That's also the only solution to have non HTTP ports accessible (for instance Jenkins needs 80, but also 50000 for the slaves).

However, as the ports are not dynamic with a classic load balancer, you have to do some port mapping, for example:

myloadbalancer.mydomain.com:80 (port 80 of the load balancer) -> instance:8081 (external port of your container) -> service:80 (internal port of your container).

And of course you need one load balancer per service.

Jenkins healthcheck

If that's really a Jenkins service that you want to launch, you should use the Jenkins Metrics plugin to obtain a good healthcheck URL.

Install it, and in the global options, generate a token and activate the ping, and you should be able to reach an URL looking like this: http://myjenkins.domain.com/metrics/mytoken12b3ad1/ping

This URL will answer the HTTP code 200 only when the server is fully running, which is important for the load balancer to activate it only when it's completely ready.

Logs

Finally, if you want to know what is happening to your instance and why it is failing, you can add logs to see what the container is saying in AWS Cloudwatch.

Just add this in the task definition (container conf):

Log configuration: awslogs
awslogs-group: mycompany (the Cloudwatch key that will regroup your container logs)
awslogs-region: us-east-1 (your cluster region)
awslogs-stream-prefix: myservice (a prefix to create the log name)

It will give you more insight about what is happening during a container initialization, if it just takes too long or if it is failing.

Hope it helps!!!