Set maximumBillingTier when reading from BigQuery

2020-05-08 01:01发布

问题:

I'm running GCP Dataflow job when I'm reading data from BigQuery as a query result. I'm using google-cloud-dataflow-java-sdk-all version 1.9.0. The code fragment that sets up the pipeline looks like this:

PCollection<TableRow> myRows = pipeline.apply(BigQueryIO.Read
            .fromQuery(query)
            .usingStandardSql()
            .withoutResultFlattening()
            .named("Input " + tableId)
    );

The query is quite complex what results in error message:

Query exceeded resource limits for tier 1. Tier 8 or higher required., error: Query exceeded resource limits for tier 1. Tier 8 or higher required.

I'd like to set maximumBillingTier as it is done in Web UI or in bq script. I can't find any way to do so except for setting default for the entire project which is unfortunately not an option.

I tried to set it through these without success:

  • DataflowPipelineOptions - neither this nor any interface it extends seems to have that setting
  • BigQueryIO.Read.Bound - I would expect it to be there just next to usingStandardSql and others similar but obviously it is not there
  • JobConfigurationQuery - this class has all cool settings but it seems it is not used at all when setting up a pipeline

Is there any way to pass this setting from within Dataflow job?

回答1:

Maybe a Googler will correct me, but it looks like you are right. I can't see this parameter exposed either. I checked both the Dataflow and the Beam APIs.

Under the hood, Dataflow is using JobConfigurationQuery from the BigQuery API, but it simply doesn't expose that parameter through its own API.

One workaround I see is to first run your complex query using the BigQuery API directly - before dropping into your pipeline. That way you can set the max billing tier through the JobConfigurationQuery class. Write the results of that query to another table in BigQuery.

Then finally, in your pipeline, just read in the table which was created from the complex query.