I'm measuring the kafka producer producer performance. Currently I've met two clients with bit different configuration and usage:
Common:
def buildKafkaConfig(hosts: String, port: Int): Properties = {
val props = new Properties()
props.put("metadata.broker.list", brokers)
props.put("serializer.class", "kafka.serializer.StringEncoder")
props.put("producer.type", "async")
props.put("request.required.acks", "0")
props.put("queue.buffering.max.ms", "5000")
props.put("queue.buffering.max.messages", "2000")
props.put("batch.num.messages", "300")
props
}
First Client:
"org.apache.kafka" % "kafka_2.11" % "0.8.2.2"
Usage:
val kafkaConfig = KafkaUtils.buildKafkaConfig("kafkahost", 9092)
val producer = new Producer[String, String](new ProducerConfig(kafkaConfig))
// ... somewhere in code
producer.send(new KeyedMessage[String, String]("my-topic", data))
Second Client:
"org.apache.kafka" % "kafka-clients" % "0.8.2.2"
Usage:
val kafkaConfig = KafkaUtils.buildKafkaConfig("kafkahost", 9092)
val producer = new KafkaProducer[String, String](kafkaConfig)
// ... somewhere in code
producer.send(new ProducerRecord[String, String]("my-topic", data))
My questions are:
- What is the difference between 2 clients?
- Which properties should I configure, take into account to achieve optimal, high heavy writes performance, for high scale application?
They are simply old vs new APIs. Kafka starting 0.8.2.x exposed a new set of API's to work with kafka, older being
Producer
which works withKeyedMessage[K,V]
where the new API isKafkaProducer
withProducerRecord[K,V]
:You should preferably be using the new supported version.
This is a very broad question, which depends a lot on the architecture of your software. It varies with scale, amount of producers, amount of consumers, etc.. There are many things to be taken into account. I would suggest going through the documentation and reading up the sections talking about Kafka's architecture and design to get a better picture of how it works internally.
Generally speaking, from my experience you'll need to balance the replication factor of your data, along with retention times and number of partitions each queue goes into. If you have more specific questions down the road, you should definitely post a question.