I've setup a Flink 1.2 standalone cluster with 2 JobManagers and 3 TaskManagers and I'm using JMeter to load-test it by producing Kafka messages / events which are then processed. The processing job runs on a TaskManager and it usually takes ~15K events/s.
The job has set EXACTLY_ONCE checkpointing and is persisting state and checkpoints to Amazon S3.
If I shutdown the TaskManager running the job it takes a bit, a few seconds, then the job is resumed on a different TaskManager. The job mainly logs the event ids which are consecutive integers (e.g. from 0 to 1200000).
When I check the output on the TaskManager I shut down the last count is for example 500000, then when I check the output on the resumed job on a different TaskManager it starts with ~ 400000. This means ~100K of duplicated events. This number is dependent on the speed of the test can be higher or lower.
Not sure if I'm missing something but I would expect the job to display the next consecutive number (like 500001) after resuming on the different TaskManager.
Does anyone know why this is happening / extra settings I have to configure to obtain the exactly once?
相关问题
- Can we combine both and count and process time Tri
- IOExcpetion while connecting to Twitter Streaming
- Exception when trying to upgrade to flink 1.3.1
- Is global state with multiple workers possible in
- Flink TaskManager timeout?
相关文章
- How to handle errors in custom MapFunction correct
- Flink: Sharing state in CoFlatMapFunction
- Apache Flink (v1.6.0) authenticate Elasticsearch S
- How to filter Apache flink stream on the basis of
- Is it possible to process multiple streams in apac
- Manage state with huge memory usage - querying fro
- MongoDB as datasource to Flink
- Why does Apache Flink need Watermarks for Event Ti
You are seeing the expected behavior for exactly-once. Flink implements fault-tolerance via a combination of checkpointing and replay in the case of failures. The guarantee is not that each event will be sent into the pipeline exactly once, but rather that each event will affect your pipeline's state exactly once.
Checkpointing creates a consistent snapshot across the entire cluster. During recovery, operator state is restored and the sources are replayed from the most recent checkpoint.
For a more thorough explanation, see this data Artisans blog post: High-throughput, low-latency, and exactly-once stream processing with Apache Flink™, or the Flink docs.