Google Cloud Dataflow BigQueryIO.Write occur Unkno

2019-05-22 14:13发布

问题:

Has somebody occur same problem with me that Google Cloud Dataflow BigQueryIO.Write happen unknown error (http code 500)?

I use Dataflow to handle some data in April, May, June, I use same code to process April data (400MB) and write to BigQuery success, but when I process May (60MB) or June (90MB) data, It was fail.

  • The data format in April, May and June are same.
  • Change writer from BigQuery to TextIO, job will success, so I think data format is good.
  • Log Dashboard no any error log.....
  • System only same unknown error

The code I wrote is here: http://pastie.org/10907947

Error Message after "Executing BigQuery import job":

Workflow failed. Causes: 
(cc846): S01:Read Files/Read+Window.Into()+AnonymousParDo+BigQueryIO.Write/DataflowPipelineRunner.BatchBigQueryIOWrite/DataflowPipelineRunner.BatchBigQueryIONativeWrite failed., 
(e19a27451b49ae8d): BigQuery import job "dataflow_job_631261" failed., (e19a745a666): BigQuery creation of import job for table "hi_event_m6" in dataset "TESTSET" in project "lib-ro-123" failed., 
(e19a2749ae3f): BigQuery execution failed., 
(e19a2745a618): Error: Message: An internal error occurred and the request could not be completed. HTTP Code: 500

回答1:

Sorry for the frustration. Looks like you are hit a limit on the number of files being written to BQ. This is a known issue that we're in the process of fixing.

In the meantime, you can work around this issue by either decreasing the number of input files or resharding the data (do a GroupByKey and then ungroup the data -- semantically it's a no-op, but it forces the data to be materialized so that the parallelism of the write operation isn't constrained by the parallelism of the read).



回答2:

Dataflow SDK for Java 1.x: as a workaround, you can enable this experiment in : --experiments=enable_custom_bigquery_sink

In Dataflow SDK for Java 2.x, this behavior is default and no experiments are necessary.

Note that in both versions, temporary files in GCS may be left over if your job fails.

Hope that helps!