Performance issues on Dataflow batch loads using A

2019-03-03 07:21发布

站内文章 / 移动开发

36 0

太酷不给撩

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I was doing a performance benchmarking of dataflow batch loads and found that the loads were just too slow when compared against the same loads on Bigquery command line tool.

The file size was around 20 MB with millions of records. I tried different machine types and got the best load performance on n1-highmem-4 with the approx load time of 8 minutes in loading the target BQ table.

When the same table load was applied by running BQ command on the command-line utility, it hardly took 2 minutes to process and load the same volume of data. Any insights about this poor load performance using Dataflow jobs? How to improve the performance to make it comparable to BQ command line utility?

回答1:

Most likely, a few minutes are being spent on starting and shutting down VMs. If you're doing something that can directly be done using BQ CLI, then using Dataflow for that purpose is likely overkill. However, you can update your question with more details (e.g. your code and the Dataflow job id) - maybe there's something else inefficient going on.