Using subquery for _PARTITIONTIME in bigquery does

2019-07-10 03:31发布

When I run the below query on BQ using standard SQL, it says it will process 76.6TB when running

SELECT 
event_time, user_id, activity_id,dbm_insertion_order_id, dbm_total_media_cost_usd 
FROM `raw.5295.activity_*`
WHERE _PARTITIONTIME >(SELECT * FROM `analytics-dwh.autobidding.activity_list` )
AND timestamp_micros(event_time)  > (SELECT timestamp_micros(MAX(event_time)) from `essence-analytics-dwh.ml_for_autobidding.nest_na_4q18_activity_updated_daily`)
AND _TABLE_SUFFIX IN ('25','20')

The table analytics-dwh.autobidding.activity_list has just one column with a unique list of integers

If I remove the subquery from the above table, the query uses less than 500GB when run

SELECT 
event_time, user_id, activity_id,dbm_insertion_order_id, dbm_total_media_cost_usd 
FROM `raw.5295.activity_*`
WHERE _PARTITIONTIME >TIMESTAMP('2018-12-20')
AND timestamp_micros(event_time)  > (SELECT timestamp_micros(MAX(event_time)) from `essence-analytics-dwh.ml_for_autobidding.nest_na_4q18_activity_updated_daily`)
AND _TABLE_SUFFIX IN ('25','20')

Why does this happen when I use a subquery? Is there a workaround?

2条回答
爷的心禁止访问
2楼-- · 2019-07-10 03:47

From looking at your query I assume you want to select the MIN or MAX date from this:

SELECT * FROM `analytics-dwh.autobidding.activity_list`

And pass it to the 2nd part of your query

This can be done by using BigQuery API inside a code which will pass the value between the 2 steps (For example Python or Javascript)

查看更多
Emotional °昔
3楼-- · 2019-07-10 04:00

Why does this happen with use of subquery?

In general, partition pruning will reduce query cost when the filters can be evaluated at the outset of the query without requiring any subquery evaluations or data scans.

You can see more about Limiting partitions queried using pseudo columns

So, in your first query (where you use subquery) - the pruning is not happening (It does not limit use of partitions based on the condition that involves the subquery)

In second query you use _PARTITIONTIME >TIMESTAMP('2018-12-20') so partitions are limited

Bottom line: filters on _PARTITIONTIME that include subqueries can't be used to limit the number of partitions scanned for a partitioned table.

Is there a workaround?

You should split your task to two steps: calculate filter for _PARTITIONTIME using whatever logic you need off of table analytics-dwh.autobidding.activity_list and then use it (instead of subquery) - using any client of your choice

查看更多
登录 后发表回答