I have a table with a column updated_at
, which is a sort key. After running both VACUUM
and ANALYZE
on the table, this is the query plan I get when filtering on updated_at
:
EXPLAIN
SELECT *
FROM my_table
WHERE updated_at > '2018-01-01';
QUERY PLAN
XN Seq Scan on my_table (cost=0.00..0.00 rows=1 width=723)
Filter: (updated_at > '2018-01-01 00:00:00'::timestamp without time zone)
My understanding is that the query execution engine is doing a sequential scan of the table despite the sort key and thus the sort key is not doing anything?
Have a look at the "Analyzing the Query Summary" section in the Redshift docs. It shows how you can use the SVL_QUERY_SUMMARY
view to observe extremely detailed metrics for each query execution.
For the most effective sort key usage you should see the rr_scan
(range restricted scan) column set to t
and the num_rows_pre_filter
should be fairly close to the rows
count. num_rows_pre_filter
is the number of rows scanned off of disk before applying your predicate filter. NB: "Fairly close" in this context will vary depending on your specific data.
SELECT stm,seg,step,TRIM(LEFT(label,30))"label"
,rows_pre_filter,rows,avgtime,bytes,is_rrscan
FROM svl_query_summary
WHERE query = 123456
ORDER BY stm ,seg ,step;
| stm | seg | step | label | rows_pre_filter | rows | avgtime | bytes | is_rrscan |
|-----|-----|------|-------------------------------|-----------------|------|---------|-------|-----------|
| 0 | 0 | 0 | scan tbl=428142 name=my_tbl | 103665 | 6 | 52814 | 1273 | t |
| 0 | 0 | 1 | project | 0 | 6 | 52814 | 0 | f |
| 0 | 0 | 2 | sort tbl=303 | 0 | 6 | 52814 | 1288 | f |
| 1 | 1 | 0 | scan tbl=303 name=Internal | 0 | 6 | 74 | 1288 | f |
| 1 | 1 | 1 | return | 0 | 6 | 74 | 0 | f |
| 1 | 2 | 0 | merge | 0 | 0 | 275 | 0 | f |
| 1 | 2 | 1 | project | 0 | 6 | 275 | 0 | f |
| 1 | 2 | 2 | return | 0 | 6 | 275 | 1387 | f |
Sequential scans are perfectly normal on Amazon Redshift, since it does not use indexes.
The system is smart enough to skip-over blocks that do not contain desired values, since each block (which contains data for only one column) stores the minimum and maximum of each value in the block. Thus, a block with all dates prior to 2018-01-01
would be automatically skipped.
This will not appear in the EXPLAIN plan, since it depends on the actual data stored in each block.
The best thing is to run some tests and see whether it is running quickly, as you would expect. You would want to run SET enable_result_cache_for_session TO OFF
to stop caching from impacting results.
Also, try to avoid situations where the SORT KEY is cast to a different type. In the above example, if the column is a DATE but the query is using it as a TIMESTAMP, then it might not be able to skip over blocks since it has to transform the value after it is read from disk. Therefore, it might work better if there WHERE is using the exact same datatype.