Google Cloud Dataflow User-Defined MySQL Source

2019-02-10 23:14发布

问题:

I am writing a Google Dataflow Pipeline and as one of the Sources I require a MySQL resultset via a query. A couple of questions then:

  1. What would be proper way to extract data from MySQL as a step in my pipeline, can this simply be done in-line using JDBC?
  2. In the case that I indeed do need to implement "User-Defined Data Format" wrapping MySQL as a source, does anyone know if an implementation already exists and I do not need to reinvent the wheel? (don't get me wrong I would enjoy writing it, but I would imagine this would be quite a common scenario to use MySQL as a source)

Thanks all!

回答1:

A JDBC connector has been just added to Apache Beam (incubating). See JdbcIO.



回答2:

At this time, Cloud Dataflow does not provide MySQL input source.

The preferred way to implement support for this is to implement a user-defined input source that can handle MySQL queries.

An alternative way would be to execute the query in the main program and stage the results of the query to a temporary location in GCS, process the results using Dataflow, and remove the files in temporary.

Hope this helps



回答3:

Could you please clarify the need for GroupByKey in the above example? Since the previous ParDo (ReadQueryResults) returns rows key'd on primary key, wouldn't the GroupByKey essentially create a group for each row of the result set? The subsequent ParDo (Regroup) would have parallelized the processing per row even without the GroupByKey, right?