I am in charge of operating two kafka clusters (one for prod and one for our dev environment). The setup is mostly similiar, but the dev environment has no SASL/SSL setup and uses just 4 instead of 8 brokers. Each broker is assigned to a dedicated google kubernetes node with 4 vCPU and 26GB RAM.
On our dev environment we've got roughly 1000 messages in / sec and each of the 4 brokers uses pretty consistently 3 out of the 4 available CPU cores (75% CPU usage).
On our prod environment we got about 1500 messages in / sec and the CPU usage is also 3 out of 4 cores there.
It seems that CPU usage is at least the bottleneck for us and I'd like to know how I can perform a CPU profiling, so that I know what exactly is causing the high cpu usage. Since it's relatively consistent I guess it could be our snappy compression.
I am interested in all ideas how I could investigate the cause of the high cpu usage and how I could tweak that in my cluster.
Apache Kafka version: 2.1 (CPU load used to be similiar on Kafka 0.11.x too)
Dev Cluster (Snappy compression, no SASL/SSL, 4 Brokers): 1000 messages in / sec, 3 CPU cores consistent usage
Prod cluster (Snappy compression, SASL/SSL, 8 Brokers): 1500 messages in / sec, 3 CPU cores consistent usage
Side note: I already made sure producers produce their messages snappy compressed. I have access to all JMX metrics, couldn't find anything useful for figuring out the CPU usage though.
I already have metrics attached to my prometheus (this is where I got the CPU usage stats from too). The problem is that the container's CPU usage doesn't tell me WHY it is that high. I need more granularity e. g. what are CPU cycles being spent on (compression? broker communication? sasl/ssl?).
If you have access to JMX metrics you are almost done for profiling CPU. All thing have to do is installing Prometheus and Grafana and then store metrics in Prometheus and monitor them with Grafana. You can find complete steps in Monitoring Kafka
Note: If you are suspicious about snappy compression, maybe this performance test can help you
Update:
Based on Confluent, most of the CPU usage is because of SSL.