google dataflow write to bigquery table performanc

2019-03-01 06:13发布

I compared performance of processing data and output to Bigquery tables and files, difference is significant:

input: 1.5M records from about 600 files transform: construct/convert a few fields in each records, construct a key and emit key,value pairs; eventually records per each key go to one target, a file or a table;

it took 7 mins to write to 13 files, and over 60 mins write to 13 bigquery tables;

Try to understand is this expected outcome or I didn't do it right? what's the factors should be considered when write to bigquery table?

Please help, this could be show stopper for what I'm trying to do.

1条回答
倾城 Initia
2楼-- · 2019-03-01 06:31

For batch jobs, Dataflow imports data into BigQuery by writing it to GCS and then running BigQuery jobs to import that data into BigQuery. If you want to know how long the BigQuery jobs are taking I think can look at the BigQuery jobs run in your project.

You can try the following commands to get information about your BigQuery import jobs.

  bq ls -j <PROJECT ID>:

The above command should show you a list of jobs and things like the duration. (Note the colon at the end of project ID I think the colon is required).

You can then try

bq show -j <JOB ID>

To get additional information about the job.

Note you must be an owner of the project in order to be able to see jobs run by other users. This applies to BigQuery jobs run by Dataflow because Dataflow uses service account.

查看更多
登录 后发表回答