Google Cloud Dataflow User-Defined MySQL Source

2019-02-10 23:00发布

I am writing a Google Dataflow Pipeline and as one of the Sources I require a MySQL resultset via a query. A couple of questions then:

What would be proper way to extract data from MySQL as a step in my pipeline, can this simply be done in-line using JDBC?
In the case that I indeed do need to implement "User-Defined Data Format" wrapping MySQL as a source, does anyone know if an implementation already exists and I do not need to reinvent the wheel? (don't get me wrong I would enjoy writing it, but I would imagine this would be quite a common scenario to use MySQL as a source)

Thanks all!

标签： java mysql google-cloud-dataflow

3条回答

Juvenile、少年°

2楼-- · 2019-02-10 23:32

Could you please clarify the need for GroupByKey in the above example? Since the previous ParDo (ReadQueryResults) returns rows key'd on primary key, wouldn't the GroupByKey essentially create a group for each row of the result set? The subsequent ParDo (Regroup) would have parallelized the processing per row even without the GroupByKey, right?

0人赞添加讨论(0) 举报

等我变得足够好

3楼-- · 2019-02-10 23:42

At this time, Cloud Dataflow does not provide MySQL input source.

The preferred way to implement support for this is to implement a user-defined input source that can handle MySQL queries.

An alternative way would be to execute the query in the main program and stage the results of the query to a temporary location in GCS, process the results using Dataflow, and remove the files in temporary.

Hope this helps

0人赞添加讨论(0) 举报

在下西门庆

4楼-- · 2019-02-10 23:54

A JDBC connector has been just added to Apache Beam (incubating). See JdbcIO.

0人赞添加讨论(0) 举报

Google Cloud Dataflow User-Defined MySQL Source

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间