Considering an Apache Flink streaming-application with a pipeline like this:
Kafka-Source -> flatMap 1 -> flatMap 2 -> flatMap 3 -> Kafka-Sink
where every flatMap
function is a non-stateful operator (e.g. the normal .flatMap
function of a Datastream
).
How do checkpoints/savepoints work, in case an incoming message will be pending at flatMap 3
? Will the message be reprocessed after restart beginning from flatMap 1
or will it skip to flatMap 3
?
I am a bit confused, because the documentation seems to refer to application state as what I can use in stateful operators, but I don't have stateful operators in my application. Is the "processing progress" saved & restored at all, or will the whole pipeline be re-processed after a failure/restart?
And this there a difference between a failure (-> flink restores from checkpoint) and manual restart using savepoints regarding my previous questions?
I tried finding out myself (with enabled checkpointing using EXACTLY_ONCE
and rocksdb-backend) by placing a Thread.sleep()
in flatMap 3
and then cancelling the job with a savepoint. However this lead to the flink
commandline tool hanging until the sleep
was over, and even then flatMap 3
was executed and even sent out to the sink before the job got cancelled. So it seems I can not manually force this situation to analyze flink's behaviour.
In case "processing progress" is not saved/covered by the checkpointing/savepoints as I described above, how could I make sure for every message reaching my pipeline that any given operator (flatmap 1/2/3) is never re-processed in a restart/failure situation?