I want to described the following case that was on one of our production cluster
We have ambari cluster with HDP version 2.6.4
Cluster include 3 kafka machines – while each kafka have disk with 5 T
What we saw is that all kafka disks was with 100% size , so kafka disk was full and this is the reason that all kafka brokers was failed
df -h /kafka
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 5T 5T 23M 100% /var/kafka
After investigation we saw that log.retention.hours=7 days
So seems that purging is after 7 days and maybe this is the reason that kafka disks are full with 100% even if they are huge – 5T
What we want to do now – is how to avoid this case in the future?
So
We want to know – how to avoid full used capacity on kafka disks
What we need to set in Kafka config in order to purge the kafka disk according to the disk size – is it possible ?
And how to know the right value of log.retention.hours
? according to the disk size or other?
In Kafka, there are two types of log retention; size and time retention. The former is triggered by log.retention.bytes
while the latter by log.retention.hours
.
In your case, you should pay attention to size retention that sometimes can be quite tricky to configure. Assuming that you want a delete
cleanup policy, you'd need to configure the following parameters to
log.cleaner.enable=true
log.cleanup.policy=delete
Then you need to think about the configuration of log.retention.bytes
, log.segment.bytes
and log.retention.check.interval.ms
. To do so, you have to take into consideration the following factors:
log.retention.bytes
is a minimum guarantee for a single partition of a topic, meaning that if you set log.retention.bytes
to 512MB, it means you will always have 512MB of data (per partition) in your disk.
Again, if you set log.retention.bytes
to 512MB and log.retention.check.interval.ms
to 5 minutes (which is the default value) at any given time, you will have at least 512MB of data + the size of data produced within the 5 minute window, before the retention policy is triggered.
A topic log on disk, is made up of segments. The segment size is dependent to log.segment.bytes
parameter. For log.retention.bytes=1GB
and log.segment.bytes=512MB
, you will always have up to 3 segments on the disk (2 segments which reach the retention and the 3rd one will be the active segment where data is currently written to).
Finally, you should do the math and compute the maximum size that might be reserved by Kafka logs at any given time on your disk and tune the aforementioned parameters accordingly. Of course, I would also advice to set a time retention policy as well and configure log.retention.hours
accordingly. If after 2 days you don't need your data anymore, then set log.retention.hours=48
.
I think you have three options:
1) Increase the size of the disks until you notice that you have a comfortable amount of space free thanks to your increase and current retention policy of 7 days. For me a comfortable amount free is around 40% (but that is personal preference).
2) Lower your retention policy to for example 3 days and see if your disks are still full after a period of time. The right retention period varies between different use cases. If you don't need a backup of the data on Kafka when something goes wrong then just pick a very low retention period. If it is crucial that you have need those 7 days worth of data then you should not change the period but the disk sizes.
3) A combination of the options 1 and 2.
More information about optimal retention policies: Kafka optimal retention and deletion policy