Should swarm loadbalancing perform healthchecks on

2019-04-12 13:46发布


The Load Balancing section in the swarm docs don't make it clear if the internal loadbalancer also does health checks, and if it removes nodes that aren't running the service anymore (because it got killed or the node got rebooted).

In the following case I've got a service with replicas 3, 1 instance running on each of the 3 nodes.


[root@centosvm ~]# docker ps
CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS               NAMES
a593d485050a        ddewaele/springboot.crud.sample:latest   "sh -c 'java $JAVA_OP"   7 minutes ago       Up 7 minutes                            springbootcrudsample.1.5syc6j4c8i3bnerdqq4e1yelm


[root@node1 ~]# docker ps
CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS               NAMES
d3b3fbc0f2c5        ddewaele/springboot.crud.sample:latest   "sh -c 'java $JAVA_OP"   4 minutes ago       Up 4 minutes                            springbootcrudsample.3.7y1oyjyrifgkmxlr20oai5ppl

Node 2:

[root@node2 ~]# docker ps
CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS               NAMES
ebca8f24ec3a        ddewaele/springboot.crud.sample:latest   "sh -c 'java $JAVA_OP"   7 minutes ago       Up 7 minutes                            springbootcrudsample.2.4tqjad7od8ep047s55485na1t

Now, on node1, we kill the docker container. This node will be without a service (swarm will re-create it here after a couple of seconds to keep the replication=3 on the service)

[root@node1 ~]# docker kill d3b3fbc0f2c5

Container gone

[root@node1 ~]# docker ps
CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS               NAMES

New container up

[root@node1 ~]# docker ps
CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS               NAMES
b8c9a7a5cf97        ddewaele/springboot.crud.sample:latest   "sh -c 'java $JAVA_OP"   11 seconds ago      Up 9 seconds                            springbootcrudsample.3.9v4cnhi8dvq7n8afb2kvp28sk

In the output below however, when container d3b3fbc0f2c5 was killed, the ingress loadbalancer didn't detect this, and it was still sending traffic to the node (resulting in connection refused) ?

How should we handle such a scenario ? Do we still need an external loadbalancer for this scenario and how should we configure it ?

[root@centosvm ~]# while :; do curl http://localhost:8080/env/hostname ; echo "" ; sleep 1; done
curl: (7) Failed connect to localhost:8080; Connection refused

curl: (7) Failed connect to localhost:8080; Connection refused

curl: (7) Failed connect to localhost:8080; Connection refused

curl: (7) Failed connect to localhost:8080; Connection refused

curl: (7) Failed connect to localhost:8080; Connection refused

curl: (7) Failed connect to localhost:8080; Connection refused



As indicated by François Maturel, with a proper healthcheck in place, Docker Swarm will take into account the health status of the container to decide if it will route requests to it.

For Spring Boot applications that have enabled the default actuators, adding this to the Dockerfile is sufficient for a basic healthcheck. When the Spring Boot app is initialized and its health actuator is enabled, the following http request will return a valid http 200 response and the healthcheck will pass.

HEALTHCHECK CMD wget -q http://localhost:8080/health -O /dev/null

This will result in your docker containers being anble to reach a healthy status. When your docker container is started, the service running within it might still be initializing. To do proper load balancing and detect service health, Swarm needs to know when it is able to route reqeusts to a particular service instance (container on a node).

So when Swarm starts a service replica, it fires up a container, it will wait until the health status of the service is "healthy". As your container is starting, it will transition from "starting" :

CONTAINER ID        IMAGE                                                                                                     COMMAND                  CREATED             STATUS                                     PORTS               NAMES
5001e1c46953        ddewaele/springboot.crud.sample@sha256:4ce69c3f50c69640c8240f9df68c8816605c6214b74e6581be44ce153c0f3b7a   "/docker-entrypoin..."   5 seconds ago       Up Less than a second (health: starting)                       springbootcrudsample.2.yt6d38zhhq2wxt1d6qfjz5974

to 'healthy'. Only then will the Swarm load balancer route requests to this endpoint.

[root@centos-a ~]# docker ps
CONTAINER ID        IMAGE                                                                                                     COMMAND                  CREATED              STATUS                        PORTS               NAMES
5001e1c46953        ddewaele/springboot.crud.sample@sha256:4ce69c3f50c69640c8240f9df68c8816605c6214b74e6581be44ce153c0f3b7a   "/docker-entrypoin..."   About a minute ago   Up About a minute (healthy)                       springbootcrudsample.2.yt6d38zhhq2wxt1d6qfjz5974


@ddewaele is correct, so here's some more tidbits:

  • No the LB does not perform port connection checks directly, that's the job of the Docker engine kicking off the healthchecks, which could be a simple curl or much more.
  • healthchecks are critical to zero downtime deployments. Especially if your container takes more then a sub-second to startup or shutdown. Without a healthcheck, docker only knows "Does Linux say the process is running?"
  • You can use docker events to see it kicking off exec commands in each container with a healthcheck set for their Swarm service. You can also see there how it'll mark the task/container as healthy/unhealthy.
  • There have been issues/bugs with the ingress load balancer sending packets during update/shutdown of tasks, but AFAIK as of 17.12 (just released) those are mostly/all fixed. One of the old issues is that the LB might not remove the task from its route table before the container shutdown starts but people are reporting better results from the last few releases.