I'm trying to figure out how DataFlow scales certain operations and how to make then performing best. To start with I just created a simple flow that reads data from BigQuery (~25M rows, 30GB total), does a JSON extract, a simple group by key and then one aggregation the group by (~100 elements each), and does another transform on each key and puts it back into a new table (~500k rows, 25gb total).
The total pipeline execution time for this is between 10-18 minutes depending on how many workers I assign or if I use multi core machines, etc. I can't speed it up below this. I have not determined the exact threshold, but basically the difference between 20 single cores, or 10 quad cores is not reliably measurable anymore.
So my question is how to investigate this further and figure out which step takes most time and how I can improve it. I assume DataFlow itself takes care of scaling individual steps and balancing among them. But for me it would for example be interesting now to see first message received and last message send, and maybe the throughput over time per step. Is this something that is available somewhere, or do I have to start instrumenting and logging this myself? With this information I would then start to optimize individual steps based on this and maybe override DataFlows scaling.
Is this the right approach, or are there better approaches available?
** The target time for me is to get this down to 2 minutes.