Lambda Architecture - Why batch layer

2019-04-12 12:55发布

I am going through the lambda architecture and understanding how it can be used to build fault tolerant big data systems.

I am wondering how batch layer is useful when everything can be stored in realtime view and generate the results out of it? is it because realtime storage cant be used to store all of the data, then it wont be realtime as the time taken to retrieve the data is dependent on the the space it took for the data to store.

3条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-04-12 13:27

Further to the answer provided by @karthik manchala, data Processing can be handled in three ways - Batch, Interactive and Real-time / Streaming.

I believe, your reference to real-time is more with interactive response than to streaming as not all use cases are streaming related.

Interactive responses are where the response can be expected anywhere from sub-second to few seconds to minutes, depending on the use case. Key here is to understand that processing is done on data at rest i.e. already stored on a storage medium. User interacts with the system while processing and hence waits for the response. All the efforts of Hive on Tez, Impala, Spark core etc are to address this issue and make the responses as fast as possible.

Streaming on the other side is where data streams into the system in real-time - for example twitter feeds, click streams etc and processing need to be done as soon as the data is generated. Frameworks like Storm, Spark Streaming address this space.

The case for batch processing is to address scenarios where some heavy-lifting need to be done on a huge dataset before hand such that user would be made believe that the responses he sees are real-time. For example, indexing a huge collection of documents into Apache Solr is a batch job, where indexing would run for minutes or possibly hours depending on the dataset. However, user who queries the Solr index would get the response in sub-second latency. As you can see, indexing cannot be achieved in real-time as there may be hue amounts of data. Same is the case with Google search, where indexing would be done in a batch mode and the results are presented in interactive mode.

All the three modes of data processing are likely involved in any organisation grappling with data challenges. Lambda Architecture addresses this challenge effectively to use the same data sources for multiple data processing requirements

查看更多
手持菜刀,她持情操
3楼-- · 2019-04-12 13:29

Why batch layer

To save Time and Money!

It basically has two functionalities,

  • To manage the master dataset (assumed to be immutable)
  • To pre-compute the batch views for ad-hoc querying

Everything can be stored in realtime view and generate the results out of it - NOT TRUE

The above is certainly possible, but not feasible as data could be 100's..1000's of petabytes and generating results could take time.. a lot of time!

Key here, is to attain low-latency queries over large dataset. Batch layer is used for creating batch views (queries served with low-latency) and realtime layer is used for recent/updated data which is usually small. Now, any ad-hoc query can be answered by merging results from batch views and real-time views instead of computing over all the master dataset.

Also, think of a query (same query?) running again and again over huge dataset.. loss of time and money!

查看更多
萌系小妹纸
4楼-- · 2019-04-12 13:30

You can check out the Kappa-Architecture where there is no seperate Batch-Layer. Everything is analyzed in the Stream-Layer. You can use Kafka in the right configuration as as master-datasetstorage and save computed data in a database as your view.

If you want to recompute, you can start a new Stream-Processing job and recompute your view from Kafka into your database and replace your old view. It is possible to use only the Realtime view as the main storage for adhoc query but as it is already mentioned in other answers, it is faster if you have much data to do batch-processing and stream-processing seperate instead of doing batch-jobs as a stream-job. It depends on the size of your data. Also it is cheaper to have a storage like hdfs instead of a database for batch-computing.

And the last point in many cases you have different algorithms for batch and stream processing, so you need to do it seperate. But basically it is possible to only use the "realtime view" as your batch-and stream-layer also without using Kafka as masterset. It depends on your usecase.

查看更多
登录 后发表回答