In my specific use-case, we are going to ingest 1000GB of data everyday. If I compress the files locally then it comes about 100GB.
I wrote a sample application to stream 100MB file (which converts to 10MB after compression). Single producer, single topic with single partition.
I have use transactions and enabled compression (gzip). I ran command to find out the total size of the partition and it came about 85MB. As Kafka, might be adding some data; in order to guarantee exactly-once semantics. I create huge batch of messages and commit them in transactions. Each message is compressed.
I also looked at what Kafka has stored internally:
- 0000.index
- 0000.log (this consumed the most amount of disk-space)
- 0000.timeindex
- 0000.snapshot
- leader-epoch-checkpoint
I have 2 questions:
Why Kafka topic uses so much disk space even after compression?
What can I do to reduce the disk space of my partition? FYI, log compaction will not be effective in my case, as every message is going to have a unique key.