How to reduce disk space occupied by a partition?

2019-08-25 02:35发布

问题:

In my specific use-case, we are going to ingest 1000GB of data everyday. If I compress the files locally then it comes about 100GB.

I wrote a sample application to stream 100MB file (which converts to 10MB after compression). Single producer, single topic with single partition.

I have use transactions and enabled compression (gzip). I ran command to find out the total size of the partition and it came about 85MB. As Kafka, might be adding some data; in order to guarantee exactly-once semantics. I create huge batch of messages and commit them in transactions. Each message is compressed.

I also looked at what Kafka has stored internally:

  • 0000.index
  • 0000.log (this consumed the most amount of disk-space)
  • 0000.timeindex
  • 0000.snapshot
  • leader-epoch-checkpoint

I have 2 questions:

  1. Why Kafka topic uses so much disk space even after compression?

  2. What can I do to reduce the disk space of my partition? FYI, log compaction will not be effective in my case, as every message is going to have a unique key.