I was going through the Apache posts and found a new term called Beam. Can anybody explain what exactly Apache Beam is? I tried to google out but unable to get a clear answer.
问题:
回答1:
Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.
History: The model behind Beam evolved from a number of internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. This model was originally known as the “Dataflow Model” and first implemented as Google Cloud Dataflow -- including a Java SDK on GitHub for writing pipelines and fully managed service for executing them on Google Cloud Platform. Others in the community began writing extensions, including a Spark Runner, Flink Runner, and Scala SDK. In January 2016, Google and a number of partners submitted the Dataflow Programming Model and SDKs portion as an Apache Incubator Proposal, under the name Apache Beam (unified Batch + strEAM processing). Apache Beam graduated from incubation in December 2016.
Additional resources for learning the Beam Model:
- The Apache Beam website
- The VLDB 2015 paper (using the original naming Dataflow model)
- Streaming 101 and Streaming 102 posts on O’Reilly’s Radar site
- A Beam podcast on Software Engineering Radio
回答2:
Apache Beam (Batch + strEAM) is a model and set of APIs for doing both batch and streaming data processing. It was open-sourced by Google (with Cloudera and PayPal) in 2016 via an Apache incubator project.
The page Dataflow/Beam & Spark: A Programming Model Comparison - Cloud Dataflow contrasts the Beam API with Apache Spark, which has been hugely successful at bringing a modern, flexible API and set of optimization techniques for both batch and streaming to the Hadoop world and beyond.
Beam tries to take all that a step further via a model that makes it easy to describe the various aspects of the out-of-order processing that often is an issue when combining batch and streaming processing, as described in that Programming Model Comparison.
In particular, to quote from the comparison, The Dataflow model is designed to address, elegantly and in a way that is more modular, robust and easier to maintain:
... the four critical questions all data processing practitioners must attempt to answer when building their pipelines:
- What results are calculated? Sums, joins, histograms, machine learning models?
- Where in event time are results calculated? Does the time each event originally occurred affect results? Are results aggregated in fixed windows, sessions, or a single global window?
- When in processing time are results materialized? Does the time each event is observed within the system affect results? When are results emitted? Speculatively, as data evolve? When data arrive late and results must be revised? Some combination of these?
- How do refinements of results relate? If additional data arrive and results change, are they independent and distinct, do they build upon one another, etc.?
The pipelines described in Beam can in turn be run on Spark, Flink, Google's Dataflow offering in the cloud, and other "runtimes", including a "Direct" local machine option.
A variety of languages are supported by the architecture. The Java SDK is available now. A Dataflow Python SDK is nearing release, and others are envisioned for Scala etc.
See the source at Mirror of Apache Beam