BigQuery streaming 'insertAll' performance

2020-03-26 06:52发布

问题:

We're streaming a high volume of data server-side into BigQuery using the google-api-php-client library. The streaming works fine apart from the performance.

Our load testing is giving us an average time of 1000ms (1 sec) to stream one row into BigQuery. We can't have the client waiting for more than 200ms. We've tested with smaller payloads and the time remains the same. Async calls on the client side is not an option for us.

The 'bottleneck' line of code is:

$service->tabledata->insertAll(PROJECT_NUMBER, DATA_SET, TABLE, $request);

Having looked under the hood of the library the call to insert the row is simply a cURL request (Curl.php in the library).

Is there any way to modify the insertAll() to make it faster? We don't care about the result so a fire-and-forget would work for us. We've tried setting CURLOPT_CONNECTTIMEOUT_MS and CURLOPT_TIMEOUT_MS in the underlying cCURL request but it does not work.

回答1:

Reading all your comments, and side notes. The approach you've chosen does not scale, and won't scale. You need to rethink the approach with async processes.

Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.

Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.

Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.

You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.

On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.

It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.

A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.

Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.