I am deploying a simple hello world nginx container with marathon, and everything seems to work well, except that I have 6 containers that will not deregister from consul. docker ps
shows none of the containers are running.
I tried using the /v1/catalog/deregister
endpoint to deregister the services, but they keep coming back. I then killed the registrator container, and tried deregistering again. They came back.
I am running registrator with
docker run -d --name agent-registrator -v /var/run/docker.sock:/tmp/docker.sock --net=host gliderlabs/registrator consul://127.0.0.1:8500 -deregister-on-success -cleanup
There is 1 consul agent running.
Restarting the machine (this is a single node installation on a local vm) does not make the services go away.
How do I make these containers go away?
Here is how you can absolutely delete all the zombie services: Go into your consul server, find the location of the json files containing the zombies and delete them.
For example I am running consul in a container:
Notice the flag
-v /mnt:/data
this is where all the data consul is storing is located. For me it was located in/mnt
. Under this directory you will find several other directories.config raft serf services tmp
Go into
services
and you will see the files that contain the json info of your services, find any ones that contains the info of zombies and delete them. Then restart consul. Then repeat for each server in your cluster that has zombies on it.Try to switch to v5
docker run -d --name agent-registrator -v /var/run/docker.sock:/tmp/docker.sock gliderlabs/registrator:v5 -internal consul://172.16.0.4:8500
Using the http api for removing services is another much nicer solution. I just figured out how to manually remove services before I figured out how to use the https api.
To delete a service with the http api use the following command:
curl -v -X PUT http://<consul_ip_address>:8500/v1/agent/service/deregister/<ServiceID>
Note that your is a combination of three things: the IP address of host machine the container is running on, the name of the container, and the inner port of the container (i.e. 80 for apache, 3000 for node js, 8000 for django, ect) all separated by colins
:
Heres an example of what that would actually look like:
curl -v -X PUT http://1.2.3.4:8500/v1/agent/service/deregister/192.168.1.1:sharp_apple:80
If you want an easy way to get the ServiceID then just curl the service that contains a zombie:
curl -s http://<consul_ip_address>:8500/v1/catalog/service/<your_services_name>
Heres a real example for a service called someapp that will return all the services under it:
curl -s http://1.2.3.4:8500/v1/catalog/service/someapp
This is one of the problems with Consul and registrator, if the service doesn't have a check associated with it, the service will stick around until it's de-registered and be "active". So it's good practice to have services register a health check as well. That way they will at least be critical if registrator messes up and forgets to de-register the service (which I see happens a lot). Alex's answer, of erasing the files in consul's data/services directory (then consul reload) definitely works to erase the service, but registrator will re-add them, if the containers are still around and running. Apparently the newer registrator versions are better at cleanup, but I've had mixed success. Now I don't use registrator at all, since it doesn't add health checks. I use nomad to run my containers (also from hashicorp) and it will create the service AND create the health check, and does a great job of cleaning up after itself.
Don't use catalog, instead of using agent, the reason is catalog is maintained by agents, it will be resync-back by agent even if you remove it from catalog, remove zombie services shell script:
json parser jq is used for field retrieving
In a Consul Cluster the Agents are considered authoritative. If you use the the HTTP Api /v1/catalog/deregister endpoint to deregister services, it will keep coming back as long as other Agents have known about that service. It's the way that the Gossip protocol works.
If you want Services to go away immediately you need to deregister the host agent properly by issuing a consul leave before killing the service on the node.