Analyze Cloud Data Flow BigQuery Throughput / Pipe

2019-04-11 19:31发布

问题:

I'm trying to figure out how DataFlow scales certain operations and how to make then performing best. To start with I just created a simple flow that reads data from BigQuery (~25M rows, 30GB total), does a JSON extract, a simple group by key and then one aggregation the group by (~100 elements each), and does another transform on each key and puts it back into a new table (~500k rows, 25gb total).

The total pipeline execution time for this is between 10-18 minutes depending on how many workers I assign or if I use multi core machines, etc. I can't speed it up below this. I have not determined the exact threshold, but basically the difference between 20 single cores, or 10 quad cores is not reliably measurable anymore.

So my question is how to investigate this further and figure out which step takes most time and how I can improve it. I assume DataFlow itself takes care of scaling individual steps and balancing among them. But for me it would for example be interesting now to see first message received and last message send, and maybe the throughput over time per step. Is this something that is available somewhere, or do I have to start instrumenting and logging this myself? With this information I would then start to optimize individual steps based on this and maybe override DataFlows scaling.

Is this the right approach, or are there better approaches available?

** The target time for me is to get this down to 2 minutes.

回答1:

Dataflow first executes a BigQuery export job to copy your data to GCS before it reads it in and processes it. Similarly, Dataflow writes the data to GCS and then does a BigQuery import job to load the table.

If you look at the Job Messages and/or the Logs you should be able to see when the export job finishes (and your code begins to read from the exported files) and when the import job starts (and your code has finished writing all output data). If the majority of the 10-18 minutes was spent in BigQuery import/export jobs, then tuning the pipeline won't have a significant impact on performance.

Other common issues are having one or two keys which are particularly hot (eg., containing most of the data). From your description it sounds like this isn't the case (you mention that each key has about 100 elements).