I am trying to optimize my querying in BigQuery for cost, and I have been trying out Clustered tables. For reference: BigQuery - querying only a subset of keys in a table with key value schema
Clustering the table by a single column is successfully reducing my query size. However, using multiple columns (example shown in: https://cloud.google.com/bigquery/docs/querying-clustered-tables#sample_table_used_in_the_examples) is not leading to any reduction in query size.
To use the example given in the documentation,
SELECT
SUM(totalSale)
FROM
mydataset.ClusteredSalesData
WHERE
customer_id = 10000
AND product_id LIKE 'gcp_analytics%'
This queries the entire data set (say, 100GB) if there was no clustering on the table, reduces to about 10GB (seen after actual run, not at the validator) when clustered only by customer_id, but does not change at all when clustered by both customer_id and product_id (even after actual run of the query).
I have tried changing the order of the clustering, order of the WHERE clauses, etc. Nothing seems to change anything.
Is this expected behavior? Bug on BigQuery? Or am I doing something wrong?
UPDATE: Thanks to @Pentium10 for pointing me to: https://medium.com/@hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b
To use the examples from the blogpost, among the following two queries,
Q1:
SELECT wiki, SUM(views) views
FROM fh-bigquery.wikipedia_v3.pageviews_2017
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki = 'en'
--AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki
Q2:
SELECT wiki, SUM(views) views
FROM fh-bigquery.wikipedia_v3.pageviews_2017
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki = 'en'
AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki
I would have expected Q2 to be cheaper since clustering is by (wiki, title), but that does not seem to be the case.
I tested the next queries based on this post, suggested by Pentium10:
180.19GB processed (according to the
validator
). 10.3GB processed running the query.86.1GB processed (according to the
validator
). 18.4GB processed running the query.180.19GB processed (according to the
validator
). 113.85GB processed running the query.Everything looks coherent since, as Mr. Hoffa said, "order matters" for clustered tables ('wiki' saves more than 'title').
Is true that the validator is still not properly working but
clustered tables
are still on beta, so we can expect an improvement in the future.In your Query 1 (Q1) - estimated cost is $0.43 for processing
86.1 GB
- actual bytes billed -18.4 GB
For Q2 - estimated cost $0.90 for processing
180 GB
- actual bytes billed -10.3 GB
Sounds like cheaper to me :o)
But in reality, save is even greater than just
down to 10.3 GB from 18.4 GB
- it is actuallydown to 10.3 GB from 38.6 GB
. It is because in Q1 there is notitle
column involved at all, while in Q2 it is!So, to compare apples to apples - you can add
title
as in below - you will see that estimated cost will be180 GB
- while actual bytes billed -38.6 GB
Q3:
Note/reminder: when you query clustered table - estimation is shown based on partition only (as of now - my understanding that later this will be fixed/improved)