Assume that I have a multi-broker (running on the same host) Kafka setup with 3 brokers and 50 topics each of which is configured to have 7 partitions and replication factor of 3.
I have 50GB of memory to spend for kafka and make sure that Kafka logs will never exceed this amount of memory so I want to configure my retention policy in order to prevent this scenario.
I have setuup a delete cleanup policy:
log.cleaner.enable=true
log.cleanup.policy=delete
and need to configure the following properties so that the data is deleted on a weekly basis and I will never run out of memory:
log.retention.hours
log.retention.bytes
log.segment.bytes
log.retention.check.interval.ms
log.roll.hours
These topics contain data streamed by tables on a Database that have a total size of about 10GB (but inserts, updates or deletes are constantly streamed in these topics).
How should I go about configuring the aforementioned parameters so that the data is removed every 7 days and make sure that data might be deleted in a shorter window if needed so that I won't run out of memory?
To accomplish what you've requested, I'd probably set
log.retention.hours
to168
, andlog.retention.bytes
to ~53687091200
, divided by the number of topics you are planning to use.log.segment.bytes
simply determines how many bytes are in a deletable log segment - the oldest log segment is what will be deleted when cleanup runs.However, these are broker-level settings - it's generally recommended to set
retention.ms
on a per-topic level instead oflog.retention.hours
, although the default value for that is exactly what you've asked for - 7 days.Regarding the time retention it's easy, just set it to what you need.
For the size retention, this is not trivial for several reasons:
the retention limits are minimum guarantees. This means if you set
log.retention.bytes
to 1GB, you will always have at least 1GB to data available on disk. This does not cover the maximum size on disk the partition can take, only the lower bound.the log cleaner only runs periodically (every 5 mins by default), so in the worst case scenario, you could end up with 1GB + the amount of data that can be written in 5 minutes. Depending on your environment, that can be a lot of data.
Kafka writes a few more files (mostly indexes) to disk in addition of the partition's data. While these files are usually small (10MB by default), you may have to consider them.
Ignoring the indexes, one decent heuristic you can use to estimate the max disk size of a partition is:
In a normal environment it's rare all partitions exceed their limits at the same time so it's usually possible to ignore the second point.
If you want to count indexes then you need to also add
segment.index.bytes
twice (there are 2 indexes: offset and timestamp) for each segment.With 3 brokers and 3 replicas, each broker will host 350 partitions. It's also probably safer to include a "fudge factor" as Kafka does not like full disk! So remove 5-10% of your total disk size, especially if you don't count indexes.
With all these gotchas in mind you should be able to find the log size you need.