I would like to use the embedded Zookeeper 3.4.9 that come with Kafka 10.2, and not install Zookeeper separately. Each Kafka broker will always have a 1:1 Zookeeper on localhost.
So if I have 5 brokers on hosts A, b, C, D and E, each with a single Kafka and Zookeeper instance running on them, then is it sufficient to just run the Zookeeper provided with Kafka?
What downsides or configuration limitations if any does the embedded 3.4.9 Zookeper have compared to the standalone version?
These are a few reason not to run zookeeper on the same box as Kafka brokers.
They scale differently
5 zk and 5 Kafka works but 6:6 or 11:11 do not. You don't need more than 5 zookeeper nodes even for a quite large Kafka cluster. Unlike Kafka, Zookeeper replicates data to all nodes so it gets slower as you add more nodes.
They compete for disk I/O
Zookeeper is very disk I/O latency sensitive. You need to have it on a separate physical disk from the Kafka commit log or you run the risk that a lot of publishing to Kafka will slow zookeeper down and cause it to drop out of the ensemble causing potential problems.
They compete for page cache memory
Kafka uses Linux OS page cache to reduce disk I/O. When other apps run on the same box as Kafka you reduce or "pollute" the page cache with other data that takes away from cache for Kafka.
Server failures take down more infrastructure
If the box reboots you lose both a zookeeper and a broker at the same time.
Even though ZooKeeper comes with each Kafka release it does not mean they should run on the same server. Actually, it is advised that in a production environment they run on separate servers.
In the Kafka broker configuration you can specify the ZooKeeper address, and it can be local or remote. This is from broker config (config/server.properties
):
# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=localhost:2181
You can replace localhost
with any other accessible server name or IP address.
We've been running a setup as you described, with 3 to 5 nodes, each running a kafka broker and the zookeeper that comes with kafka distribution on the same nodes. No issues with that setup so far, but our data throughput isn't high.
If we were to scale above 5 nodes we'd separate them, so that we only scale kafka brokers but keep the zookeeper ensemble small. If zookeeper and kafka start competing for I/O too much, then we'd move their data directories to separate drives. If they start competing for CPU, then we'd move them to separate boxes.
All in all, it depends on your expected throughput and how easily you can upgrade your setup if it starts causing contention. You can start small and easy, with kafka and zookeeper co-located as long as you have the flexibility to upgrade your setup with more nodes and introduce separation later on. If you think this will be hard to add later, better start running them separate from the start. We've been running them co-located for 18+ months and haven't encountered resource contention so far.