If the Kappa-Architecture does analysis on stream directly instead of splitting the data into two streams, where is the datastored then, in a messagin-system like Kafka? or can it be in a database for recomputing?
And is a seperate batch layer faster than recomputing with a stream processing engine for batch analytics?
"A very simple case to consider is when the algorithms applied to the
real-time data and to the historical data are identical. Then it is
clearly very beneficial to use the same code base to process
historical and real-time data, and therefore to implement the use-case
using the Kappa architecture". "Now, the algorithms used to process
historical data and real-time data are not always identical. In some
cases, the batch algorithm can be optimized thanks to the fact that it
has access to the complete historical dataset, and then outperform the
implementation of the real-time algorithm. Here, choosing between
Lambda and Kappa becomes a choice between favoring batch execution
performance over code base simplicity". "Finally, there are even more
complex use-cases, in which even the outputs of the real-time and
batch algorithm are different. For example, a machine learning
application where generation of the batch model requires so much time
and resources that the best result achievable in real-time is
computing and approximated updates of that model. In such cases, the
batch and real-time layers cannot be merged, and the Lambda
architecture must be used".
Quote
- Seperate Batch and Stream-Layer
- Higher code complexity
- Faster performance with seperate batch/stream
- better for different algorithms in batch and stream
- cheaper with a data storage for batch-computing instead of a database
- only a steam processing layer
- easier to maintain, lower complexity, single algorithm for batch and
stream
- too much data would be expensive if recomputed from a database for batch
- too much data would be slower to process if recomputed from database or from kafka for batch