We're trying to achieve a deduplication service using Kafka Streams.
The big picture is that it will use its rocksDB state store in order to check existing keys during process.
Please correct me if I'm wrong, but to make those stateStores fault tolerant too, Kafka streams API will transparently copy the values in the stateStore inside a Kafka topic ( called the change Log).
That way, if our service falls, another service will be able to rebuild its stateStore according to the changeLog found in Kafka.
But it raises a question to my mind, do this " StateStore --> changelog" itself is exactly once ?
I mean, When the service will update its stateStore, it will update the changelog in an exactly once fashion too.. ?
If the service crash, another one will take the load, but can we sure it won't miss a stateStore update from the crashing service ?
Regards,
Yannick
Short answer is yes.
Using transaction - Atomic multi-partition write - Kafka Streams insure, that when offset commit was performed, state store was also flashed to changelog topic on the brokers. Above operations are Atomic, so if one of them will failed, application will reprocess messages from previous offset position.
You can read in following blog more about exactly once semantic https://www.confluent.io/blog/enabling-exactly-kafka-streams/. There is section: How Kafka Streams Guarantees Exactly-Once Processing
.
But it raises a question to my mind, do this " StateStore --> changelog" itself is exactly once ?
Yes -- as others have already said here. You must of course configure your application to use exactly-once semantics via the configuration parameter processing.guarantee
, see https://kafka.apache.org/21/documentation/streams/developer-guide/config-streams.html#processing-guarantee (this link is for Apache Kafka 2.1).
We're trying to achieve a deduplication service using Kafka Streams. The big picture is that it will use its rocksDB state store in order to check existing keys during process.
There's also an event de-duplication example application available at https://github.com/confluentinc/kafka-streams-examples/blob/5.1.0-post/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java. This links points to the repo branch for Confluent Platform 5.1.0, which uses Apache Kafka 2.1.0 = the latest version of Kafka available right now.