BigQuery - Clustered tables not reducing query siz

2019-07-18 06:02发布

问题:

I am trying to optimize my querying in BigQuery for cost, and I have been trying out Clustered tables. For reference: BigQuery - querying only a subset of keys in a table with key value schema

Clustering the table by a single column is successfully reducing my query size. However, using multiple columns (example shown in: https://cloud.google.com/bigquery/docs/querying-clustered-tables#sample_table_used_in_the_examples) is not leading to any reduction in query size.

To use the example given in the documentation,

SELECT
  SUM(totalSale)
FROM
  mydataset.ClusteredSalesData
WHERE
  customer_id = 10000
  AND product_id LIKE 'gcp_analytics%'

This queries the entire data set (say, 100GB) if there was no clustering on the table, reduces to about 10GB (seen after actual run, not at the validator) when clustered only by customer_id, but does not change at all when clustered by both customer_id and product_id (even after actual run of the query).

I have tried changing the order of the clustering, order of the WHERE clauses, etc. Nothing seems to change anything.

Is this expected behavior? Bug on BigQuery? Or am I doing something wrong?

UPDATE: Thanks to @Pentium10 for pointing me to: https://medium.com/@hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b

To use the examples from the blogpost, among the following two queries,

Q1:

SELECT wiki, SUM(views) views
FROM fh-bigquery.wikipedia_v3.pageviews_2017
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki = 'en'
--AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki

Q2:

SELECT wiki, SUM(views) views
FROM fh-bigquery.wikipedia_v3.pageviews_2017
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki = 'en'
AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki

I would have expected Q2 to be cheaper since clustering is by (wiki, title), but that does not seem to be the case.

回答1:

In your Query 1 (Q1) - estimated cost is $0.43 for processing 86.1 GB - actual bytes billed - 18.4 GB

For Q2 - estimated cost $0.90 for processing 180 GB - actual bytes billed - 10.3 GB

Sounds like cheaper to me :o)

But in reality, save is even greater than just down to 10.3 GB from 18.4 GB - it is actually down to 10.3 GB from 38.6 GB. It is because in Q1 there is no title column involved at all, while in Q2 it is!
So, to compare apples to apples - you can add title as in below - you will see that estimated cost will be 180 GB - while actual bytes billed - 38.6 GB

Q3:

SELECT wiki, title, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki = 'en'
-- AND title = 'Barcelona'
GROUP BY wiki, title   

Note/reminder: when you query clustered table - estimation is shown based on partition only (as of now - my understanding that later this will be fixed/improved)



回答2:

I tested the next queries based on this post, suggested by Pentium10:

SELECT wiki, SUM(views) views 
FROM `fh-bigquery.wikipedia_v3.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30' 
AND wiki = 'en' 
AND title = 'Barcelona' 
GROUP BY wiki ORDER BY wiki 

180.19GB processed (according to the validator). 10.3GB processed running the query.

SELECT wiki, SUM(views) views 
FROM `fh-bigquery.wikipedia_v3.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30' 
AND wiki = 'en' 
--AND title = 'Barcelona' 
GROUP BY wiki ORDER BY wiki 

86.1GB processed (according to the validator). 18.4GB processed running the query.

SELECT wiki, SUM(views) views 
FROM `fh-bigquery.wikipedia_v3.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30' 
-- AND wiki = 'en' 
AND title = 'Barcelona' 
GROUP BY wiki ORDER BY wiki 

180.19GB processed (according to the validator). 113.85GB processed running the query.

Everything looks coherent since, as Mr. Hoffa said, "order matters" for clustered tables ('wiki' saves more than 'title').

Is true that the validator is still not properly working but clustered tables are still on beta, so we can expect an improvement in the future.