Has somebody occur same problem with me that Google Cloud Dataflow BigQueryIO.Write happen unknown error (http code 500)?
I use Dataflow to handle some data in April, May, June, I use same code to process April data (400MB) and write to BigQuery success, but when I process May (60MB) or June (90MB) data, It was fail.
- The data format in April, May and June are same.
- Change writer from BigQuery to TextIO, job will success, so I think data format is good.
- Log Dashboard no any error log.....
- System only same unknown error
The code I wrote is here: http://pastie.org/10907947
Error Message after "Executing BigQuery import job":
Workflow failed. Causes:
(cc846): S01:Read Files/Read+Window.Into()+AnonymousParDo+BigQueryIO.Write/DataflowPipelineRunner.BatchBigQueryIOWrite/DataflowPipelineRunner.BatchBigQueryIONativeWrite failed.,
(e19a27451b49ae8d): BigQuery import job "dataflow_job_631261" failed., (e19a745a666): BigQuery creation of import job for table "hi_event_m6" in dataset "TESTSET" in project "lib-ro-123" failed.,
(e19a2749ae3f): BigQuery execution failed.,
(e19a2745a618): Error: Message: An internal error occurred and the request could not be completed. HTTP Code: 500
Dataflow SDK for Java 1.x: as a workaround, you can enable this experiment in :
--experiments=enable_custom_bigquery_sink
In Dataflow SDK for Java 2.x, this behavior is default and no experiments are necessary.
Note that in both versions, temporary files in GCS may be left over if your job fails.
Hope that helps!
Sorry for the frustration. Looks like you are hit a limit on the number of files being written to BQ. This is a known issue that we're in the process of fixing.
In the meantime, you can work around this issue by either decreasing the number of input files or resharding the data (do a GroupByKey and then ungroup the data -- semantically it's a no-op, but it forces the data to be materialized so that the parallelism of the write operation isn't constrained by the parallelism of the read).