Performance issues on Dataflow batch loads using A

2019-03-03 07:21发布

问题:

I was doing a performance benchmarking of dataflow batch loads and found that the loads were just too slow when compared against the same loads on Bigquery command line tool.

The file size was around 20 MB with millions of records. I tried different machine types and got the best load performance on n1-highmem-4 with the approx load time of 8 minutes in loading the target BQ table.

When the same table load was applied by running BQ command on the command-line utility, it hardly took 2 minutes to process and load the same volume of data. Any insights about this poor load performance using Dataflow jobs? How to improve the performance to make it comparable to BQ command line utility?

回答1:

Most likely, a few minutes are being spent on starting and shutting down VMs. If you're doing something that can directly be done using BQ CLI, then using Dataflow for that purpose is likely overkill. However, you can update your question with more details (e.g. your code and the Dataflow job id) - maybe there's something else inefficient going on.