I am currently using Confluent HDFS Sink Connector (v4.0.0) to replace Camus. We are dealing with sensitive data so we need to maintain consistency in offset during cutover to connectors.
Cutover plan:
- We created hdfs sink connector and subscribed to a topic which writes to a temporary hdfs file. This creates a consumer group with name connect-
- Stopped the connector using DELETE request.
- Using /usr/bin/kafka-consumer-groups script, I am able to set the connector consumer group kafka topic partition's current offset to a desired value (i.e. last offset Camus wrote + 1).
- When i restart the hdfs sink connector, it continues reading from the last committed connector offset and ignores the set value. I am expecting the hdfs file name to be like: hdfs_kafka_topic_name+kafkapartition+Camus_offset+Camus_offset_plus_flush_size.format
Is my expectation of confluent connector behavior correct ?
When you restart this connector, it will use the offset embedded in the file have of the last file written to hdfs. It will not use the consumer group offset. It does this because it uses a write ahead log to achieve exactly once deliver to hdfs.