So my objective here is to set up a cluster of several kafka-brokers in a distributed fashion. But I can't see the way to make the brokers aware of each other.
As far as i understand, every broker needs a separate ID in their config, which I cannot guarantee or configure if I launch the containers from kubernetes?
They also need to have the same advertised_host?
Are there any parameters I'm missing that would need to be changed for the nodes to discover each other?
Would it be viable to do such a configuration at the end of the Dockerfile with a script? And/or a shared volume?
I'm currently trying to do this with the spotify/kafka-image which has a preconfigured zookeeper+kafka combination, on vanilla Kubernetes.
My solution for this has been to use the IP as the ID: trim the dots and you get a unique ID that is also available outside of the container to other containers.
With a Service you can get access to the multiple containers's IPs (see my answer here on how to do this: what's the best way to let kubenetes pods communicate with each other?
so you can get their IDs too if you use IPs as the unique ID. The only issue is that IDs are not continuous or start at 0, but zookeeper / kafka don't seem to mind.
EDIT 1:
The follow up concerns configuring Zookeeper:
Each ZK node needs to know of the other nodes. The Kubernetes discovery service knowns of nodes that are within a Service so the idea is to start a Service with the ZK nodes.
This Service needs to be started BEFORE creating the ReplicationController (RC) of the Zookeeper pods.
The start-up script of the ZK container will then need to:
KUBERNETES_SERVICE_HOST
environment variable is available in each container. The endpoint to find service description is thenURL="http(s)://$USERNAME:$PASSWORD@${KUBERNETES_SERVICE_HOST/api/v1/namespaces/${NAMESPACE}/endpoints/${SERVICE_NAME}"
where
NAMESPACE
isdefault
unless you changed it, andSERVICE_NAME
would be zookeeper if you named your service zookeeper.there you get the description of the containers forming the Service, with their ip in a "ip" field. You can do:
to get the list of IPs in the Service. With that, populate the zoo.cfg on the node using the ID defined above
You might need the USERNAME and PASSWORD to reach the endpoint on services like google container engine. These need to be put in a Secret volume (see doc here: http://kubernetes.io/v1.0/docs/user-guide/secrets.html )
You would also need to use
curl -s --insecure
on Google Container Engine unless you go through the trouble of adding the CA cert to your podsBasically add the volume to the container, and look up the values from file. (contrary to what the doc says, DO NOT put the \n at the end of the username or password when base64 encoding: it just make your life more complicated when reading those)
EDIT 2:
Another thing you'll need to do on the Kafka nodes is get the IP and hostnames, and put them in the /etc/hosts file. Kafka seems to need to know the nodes by hostnames, and these are not set within service nodes by default
EDIT 3:
After much trial and thoughts using IP as an ID may not be so great: it depends on how you configure storage. for any kind of distributed service like zookeeper, kafka, mongo, hdfs, you might want to use the emptyDir type of storage, so it is just on that node (mounting a remote storage kind of defeats the purpose of distributing these services!) emptyDir will relaod with the data on the same node, so it seems more logical to use the NODE ID (node IP) as the ID, because then a pod that restarts on the same node will have the data. That avoid potential corruption of the data (if a new node starts writing in the same dir that is not actually empty, who knows what can happen) and also with Kafka, the topics being assigned a broker.id, if the broker id changes, zookeeper does not update the topic broker.id and the topic looks like it is available BUT points to the wrong broker.id and it's a mess.
So far I have yet to find how to get the node IP though, but I think it's possible to lookup in the API by looking up the service pods names and then the node they are deployed on.
EDIT 4
To get the node IP, you can get the pod hostname == name from the endpoints API /api/v1/namespaces/default/endpoints/ as explained above. then you can get the node IP from the pod name with /api/v1/namespaces/default/pods/
PS: this is inspired by the example in the Kubernetes repo (example for rethinkdb here: https://github.com/kubernetes/kubernetes/tree/master/examples/rethinkdb
I did this using docker-compose (The difference for Kubernetes would be that you would pass the ID via your service.yaml and have 2 services):
Config:
sh:
This shows up prominently in my searches but contains pretty outdated information. To update this with a more modern solution, you should use a StatefulSet deployment, which will generate pods that have an integer counter instead of a hash in their name, eg. kafka-controller-0.
This is of course the hostname, so from there it's a simple matter to extract a fixed, invariant broker ID using awk:
The most popular containers available for Kafka these days have a broker ID command.
Look at https://github.com/CloudTrackInc/kubernetes-kafka It allows to start Kafka in kubernetes and support scaling it and auto extanding.