I am writing a Google Dataflow Pipeline and as one of the Sources I require a MySQL resultset via a query. A couple of questions then:
- What would be proper way to extract data from MySQL as a step in my pipeline, can this simply be done in-line using JDBC?
- In the case that I indeed do need to implement "User-Defined Data Format" wrapping MySQL as a source, does anyone know if an implementation already exists and I do not need to reinvent the wheel? (don't get me wrong I would enjoy writing it, but I would imagine this would be quite a common scenario to use MySQL as a source)
Thanks all!
A JDBC connector has been just added to Apache Beam (incubating). See JdbcIO.
At this time, Cloud Dataflow does not provide MySQL input source.
The preferred way to implement support for this is to implement a user-defined input source that can handle MySQL queries.
An alternative way would be to execute the query in the main program and stage the results of the query to a temporary location in GCS, process the results using Dataflow, and remove the files in temporary.
Hope this helps
Could you please clarify the need for GroupByKey in the above example? Since the previous ParDo (ReadQueryResults) returns rows key'd on primary key, wouldn't the GroupByKey essentially create a group for each row of the result set? The subsequent ParDo (Regroup) would have parallelized the processing per row even without the GroupByKey, right?