offset of partition 0 is very close to be the sum

2019-08-20 02:23发布

问题:

I have a topic composed by 5 partitions as follow:

p[0] offset: 492453047
p[1] offset: 122642552
p[2] offset: 122641146
p[3] offset: 122636144
p[4] offset: 122638175

It seems the offset from partition is very close to the sum of offset from rest partitions. I can't figure out how and why.

回答1:

With Kafka, the producer is responsible for assigning a partition to each record.

This is configurable using the partitioner.class setting. If you've not changed that, then the default partitioner works as follow:

  • If a partition is specified in the record, use it
  • If no partition is specified but a key is present choose a partition based on a hash of the key
  • If no partition or key is present choose a partition in a round-robin fashion

So it looks like you have keys that are not homogeneously spread. Either you have few different keys or significantly more records with a specific key. Keys are usually used to ensure records with the same key are sent to the same partitions (and thus stay ordered).

A bit of skew towards a partition is not necessarily bad, it mostly depends on your use case. If you think data could be partitioned better, you can implement your own partitioner.



回答2:

The Producer

The producer sends data directly to the broker that is the leader for the partition without any intervening routing tier. To help the producer do this all Kafka nodes can answer a request for metadata about which servers are alive and where the leaders for the partitions of a topic are at any given time to allow the producer to appropriately direct its requests.

The client controls which partition it publishes messages to. This can be done at random, implementing a kind of random load balancing, or it can be done by some semantic partitioning function. We expose the interface for semantic partitioning by allowing the user to specify a key to partition by and using this to hash to a partition (there is also an option to override the partition function if need be). For example if the key chosen was a user id then all data for a given user would be sent to the same partition. This in turn will allow consumers to make locality assumptions about their consumption. This style of partitioning is explicitly designed to allow locality-sensitive processing in consumers.