Kafka Streams application Endless rebalancing

We are running a kafka streams application and stuck with a strange problem. We are using both global state store and multiple other state stores.

Our application has loaded all the data and state stores has good amount of information in it now. Now, when we tried to bring down the application and bring it back again (some config changes), it is going into endless rebalancing .. To verify we reverted back config changes, but it it still stuck in that stage. There are no erros, etc

INFO  o.apache.kafka.streams.KafkaStreams - stream-client [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb] Started Streams client
INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] State transition from RUNNING to PARTITIONS_REVOKED
INFO  o.apache.kafka.streams.KafkaStreams - stream-client [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb] State transition from RUNNING to REBALANCING
INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] partition revocation took 1 ms.
    suspended active tasks: []
    suspended standby tasks: []
INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] State transition from RUNNING to PARTITIONS_REVOKED
INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] partition revocation took 0 ms.
    suspended active tasks: []
    suspended standby tasks: []
04:02:13.682 6985 [main] INFO  com..... - Started Application in 6.647 seconds (JVM running for 7.484)
04:02:23.300 16603 [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] State transition from PARTITIONS_REVOKED to PARTITIONS_ASSIGNED
04:02:23.300 16603 [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] State transition from PARTITIONS_REVOKED to PARTITIONS_ASSIGNED
04:02:23.328 16631 [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] partition assignment took 28 ms.
    current active tasks: [0_0, 1_0, 2_0, 3_0, 4_0, 5_0, 6_0, 7_5, 8_5, 9_5, 10_5, 12_4, 13_4, 14_4, 15_4, 16_4, 17_4, 19_3, 20_3, 21_3, 22_3, 23_3, 24_3, 25_3, 29_0]
    current standby tasks: [0_2]
    previous active tasks: []

04:02:23.328 16631 [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] partition assignment took 28 ms.
    current active tasks: [0_3, 1_3, 2_3, 3_3, 4_3, 5_3, 7_2, 8_2, 9_2, 10_2, 12_1, 13_1, 14_1, 15_1, 16_1, 17_1, 19_0, 20_0, 21_0, 22_0, 23_0, 24_0, 25_0, 26_0]
    current standby tasks: [0_5]
    previous active tasks: []
04:03:47.602 100905 [http-nio-8080-exec-10] INFO  c.j.d.r.b.p.base.StreamsRestService - State of Kafka Streams Application: REBALANCING
04:03:49.356 102659 [http-nio-8080-exec-2] INFO  c.j.d.r.b.p.base.StreamsRestService - State of Kafka Streams Application: REBALANCING
04:03:51.600 104903 [http-nio-8080-exec-3] INFO  c.j.d.r.b.p.base.StreamsRestService - State of Kafka Streams Application: REBALANCING
04:03:53.356 106659 [http-nio-8080-exec-4] INFO  c.j.d.r.b.p.base.StreamsRestService - State of Kafka Streams Application: REBALANCING

Number of topics - 100
Partitions per topic - 6.  (7 topics with 1 partition only)
kubernetes env - 3 pods ( 2 stream threads )

When we try to list consumer group using following command

root@bastion-0:/app/confluent-5.2.2/bin# ./kafka-consumer-groups --describe --group app  --bootstrap-server kafka-0..local:9094 --command-config /app/client-sasl-ssl.properties --members

CONSUMER-ID                                                                                               HOST                    CLIENT-ID                                                            #PARTITIONS     
app-b8c729c9-dc1c-457b-8120-457035e84e58-StreamThread-1-consumer-3b370697-e737-411c-af28-fb04cfbae1dd 1.1.1.1/1.1.1.1 app-b8c729c9-dc1c-457b-8120-457035e84e58-StreamThread-1-consumer 45              
app-aaef2f83-d51c-4b6f-bbd8-616db988bd33-StreamThread-2-consumer-3edb3e5f-9f1a-499f-8732-6cd2c8b96c96 2.2.2.2/2.2.2.2 app-aaef2f83-d51c-4b6f-bbd8-616db988bd33-StreamThread-2-consumer 45              
app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1-consumer-00e24df4-5669-4e2c-a775-8f6c4f689714 3.3.3.3/3.3.3.3 app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1-consumer 46              
app-b8c729c9-dc1c-457b-8120-457035e84e58-StreamThread-2-consumer-1b6b2955-5dfd-4be7-8ad9-9f1b54fe6310 1.1.1.1/1.1.1.1 app-b8c729c9-dc1c-457b-8120-457035e84e58-StreamThread-2-consumer 45              
app-aaef2f83-d51c-4b6f-bbd8-616db988bd33-StreamThread-1-consumer-72cd0319-8ca7-493c-891d-3022b235ea01 2.2.2.2/2.2.2.2 app-aaef2f83-d51c-4b6f-bbd8-616db988bd33-StreamThread-1-consumer 45              
app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2-consumer-c1a16d64-8d49-4758-ab64-2af3cd9aef0f 3.3.3.3/3.3.3.3 app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2-consumer 45

The output of the above command keeps on changing - from 0 to some variable number. Ideally it should become stable after some time.

Are there any tunables/configs for kafka streams balancing (rebalancing)

Questions:

What causes application to rebalance endlessly while starting (even though there are no errors/exception, etc).
Is there any tunables which can help us avoid rebalancing ?

Looking at the logs you have added, the consumer pod is starting up and so I guess maybe there is a rolling restart of the other 2 pods and hence a rebalance each time one stops and one starts.

Although Kafka is fast when running rebalance is not fast as there is chat across the group during the process - although partitions may be assigned to one consumer, the group only starts consuming when all consumers have had their assignment, and the discovery of assignment only happens within the poll method (see https://chrisg23.blogspot.com/2020/02/why-is-pausing-kafka-consumer-so.html).

Hence the way to speed up the process is to poll more frequently so that you get to hear about changes quicker, but there is a trade off - if in normal running the topics are not busy then there will be a lot of spinning doing nothing.

However, you are not quite clear on what you mean by endlessly. If you mean that the application is literally only rebalancing then see my comment above. It may be that pods are going up and down continuously (heartbeats dying) or else polling is taking a long time - are you doing a lot of I/O for each record? Restarts would be obvious from the logs and the pod names. Excessive polling would also cause warning messages suggesting you either increase max.poll.interval.ms or reduce max.poll.records