I am using Google Data Flow to implement an ETL data ware house solution.
Looking into google cloud offering, it seems DataProc can also do the same thing.
It also seems DataProc is little bit cheaper than DataFlow.
Does anybody know the pros / cons of DataFlow over DataProc
Why does google offer both?
Here are three main points to consider while trying to choose between Dataproc and Dataflow
Provisioning
Dataproc - Manual provisioning of clusters
Dataflow - Serverless. Automatic provisioning of clusters
Hadoop Dependencies
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
Portability
Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.
This flowchart from the google website explains how to go about choosing one over the other.
https://cloud.google.com/dataflow/images/flow-vs-proc-flowchart.svg
Further details are available in the below link
https://cloud.google.com/dataproc/#fast--scalable-data-processing
Same reason as why Dataproc offers both Hadoop and Spark: sometimes one programming model is the best fit for the job, sometimes the other. Likewise, in some cases the best fit for the job is the Apache Beam programming model, offered by Dataflow.
In many cases, a big consideration is that one already has a codebase written against a particular framework, and one just wants to deploy it on the Google Cloud, so even if, say, the Beam programming model is superior to Hadoop, someone with a lot of Hadoop code might still choose Dataproc for the time being, rather than rewriting their code on Beam to run on Dataflow.
The differences between Spark and Beam programming models are quite large, and there are a lot of use cases where each one has a big advantage over the other. See https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison .
Yes, Cloud Dataflow and Cloud Dataproc can both be used to implement ETL data warehousing solutions.
An overview of why each of these products exist can be found in the Google Cloud Platform Big Data Solutions Articles
Quick takeaways: