I am trying to optimize my querying in BigQuery for cost, and I have been trying out Clustered tables.
For reference: BigQuery - querying only a subset of keys in a table with key value schema
Clustering the table by a single column is successfully reducing my query size. However, using multiple columns (example shown in: https://cloud.google.com/bigquery/docs/querying-clustered-tables#sample_table_used_in_the_examples) is not leading to any reduction in query size.
To use the example given in the documentation,
SELECT
SUM(totalSale)
FROM
mydataset.ClusteredSalesData
WHERE
customer_id = 10000
AND product_id LIKE 'gcp_analytics%'
This queries the entire data set (say, 100GB) if there was no clustering on the table, reduces to about 10GB (seen after actual run, not at the validator) when clustered only by customer_id, but does not change at all when clustered by both customer_id and product_id (even after actual run of the query).
I have tried changing the order of the clustering, order of the WHERE clauses, etc. Nothing seems to change anything.
Is this expected behavior? Bug on BigQuery? Or am I doing something wrong?
UPDATE:
Thanks to @Pentium10 for pointing me to: https://medium.com/@hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b
To use the examples from the blogpost, among the following two queries,
Q1:
SELECT wiki, SUM(views) views
FROM fh-bigquery.wikipedia_v3.pageviews_2017
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki = 'en'
--AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki
Q2:
SELECT wiki, SUM(views) views
FROM fh-bigquery.wikipedia_v3.pageviews_2017
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki = 'en'
AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki
I would have expected Q2 to be cheaper since clustering is by (wiki, title), but that does not seem to be the case.
In your Query 1 (Q1) - estimated cost is $0.43 for processing 86.1 GB
- actual bytes billed - 18.4 GB
For Q2 - estimated cost $0.90 for processing 180 GB
- actual bytes billed - 10.3 GB
Sounds like cheaper to me :o)
But in reality, save is even greater than just down to 10.3 GB from 18.4 GB
- it is actually down to 10.3 GB from 38.6 GB
. It is because in Q1 there is no title
column involved at all, while in Q2 it is!
So, to compare apples to apples - you can add title
as in below - you will see that estimated cost will be 180 GB
- while actual bytes billed - 38.6 GB
Q3:
SELECT wiki, title, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki = 'en'
-- AND title = 'Barcelona'
GROUP BY wiki, title
Note/reminder: when you query clustered table - estimation is shown based on partition only (as of now - my understanding that later this will be fixed/improved)
I tested the next queries based on this post, suggested by Pentium10:
SELECT wiki, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki = 'en'
AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki
180.19GB processed (according to the validator
). 10.3GB processed running the query.
SELECT wiki, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki = 'en'
--AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki
86.1GB processed (according to the validator
).
18.4GB processed running the query.
SELECT wiki, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
-- AND wiki = 'en'
AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki
180.19GB processed (according to the validator
).
113.85GB processed running the query.
Everything looks coherent since, as Mr. Hoffa said, "order matters" for clustered tables ('wiki' saves more than 'title').
Is true that the validator is still not properly working but clustered tables
are still on beta, so we can expect an improvement in the future.