Google BigQuery queries are slow

2019-07-16 03:04发布

I am using Google BigQuery and I am executing some simple queries from PHP. (e.g. SELECT * from emails WHERE email='mail@test.com') I am just checking if the email exists in the table.

The table "emails" is empty for now. But still the PHP script takes around 4 minutes to check 175 emails on an empty table .. As I wish in future the table will be filled and will have 500 000 mails then I guess the request time will be longer.

Is that normal ? Or are there any ideas/solutions to improve the checking time ?

(P.S. : The table "emails" contains only 8 columns, all are string type)

Thank you !

2条回答
Melony?
2楼-- · 2019-07-16 03:44

Here is a sample code in PHP to do streaming inserts, using the official https://github.com/google/google-api-php-client:

/**
 * 
 * @param type $client
 * @param type $project_id
 * @param type $dataset_id
 * @param type $rows
 * @return boolean
 * @throws Google_Service_Exception
 */
public function BQ_Tabledata_InsertAll($client, $project_id, $dataset_id, $rows) {
    $bq = new Google_Service_Bigquery($client);
    $request = new Google_Service_Bigquery_TableDataInsertAllRequest();
    $request->setRows($rows);
    try {
        $resp = new Google_Service_Bigquery_TableDataInsertAllResponse();
        $resp = $bq->tabledata->insertAll($project_id, $dataset_id, static::tableId(), $request);
        $errors = new Google_Service_Bigquery_TableDataInsertAllResponseInsertErrors();
        $errors = @$resp->getInsertErrors();
        if (!empty($errors)) {
            $error_msg = '';
            if (is_array($errors)) {
                $line = 0;
                foreach ($errors as $eP) {
                    $arr = $eP->getErrors();
                    if (is_array($arr)) {
                        foreach ($arr as $e) {
                            switch ($e->getReason()) {
                                case "stopped":
                                    break;
                                default:
                                    $error_msg.= sprintf("Error on line %s: %s\r\n", $line, $e->getMessage());
                                    break;
                            }
                        }
                    }
                    $line++;
                }
                $this->setErrorMessage($error_msg);
            } else {
                $this->setErrorMessage($errors);
            }
            //print_r($errors);
            //exit;
            return false;
        }
        return true;
    } catch (Google_Service_Exception $e) {
        $this->setErrors($e->getErrors())->setErrorMessage($e->getMessage());
        throw $e;
    }
}
查看更多
劫难
3楼-- · 2019-07-16 03:49

If you are just checking for existence of a field, consider using SELECT COUNT(*) FROM emails where email='mail@test.com' instead. This will only require reading a single field, and so will cost less and be marginally faster on large tables.

And as Pentium10 suggested, consider using multiple lookups in a single query. You could do this like:

SELECT SUM((IF(email = 'mail1@test.com', 1, 0)) as m1,
       SUM((IF(email = 'mail2@test.com', 1, 0)) as m2,
       SUM((IF(email = 'mail3@test.com', 1, 0)) as m3,
       ...
 FROM emails

You're going to be limited to something like 64k of these in a single query, but it should be very fast to compute since it only requires scan of a single column in one pass.

Alternately,if you wanted the e-mails as one per row, you could do something a little bit fancier like

 SELECT email FROM emails WHERE email IN
 ('mail1@test.com', 'mail2@test.com', 'mail3@test.com'...)
 GROUP BY email

As a further optimization, you could do it as a LEFT JOIN:

SELECT t1.email as email, IF(t2.email is not null, true, false) as found 
FROM [interesting_emails] t1  
LEFT OUTER JOIN [emails] t2 ON t1.email = t2.email

If the interesting_emails had the list of emails you wanted to check, like

mail1@test.com
mail2@test.com
mail3@test.com

If the emails table contained only mail1@ and maiil2@, then you'd get back as results:

email            found
______________   _____
mail1@test.com   true
mail2@test.com   false
mail3@test.com   true

The advantage of doing it this way is that it will scale up to the billions of e-mails if needed (when the number gets large you might consider using a JOIN EACH instead of a JOIN).

查看更多
登录 后发表回答