Is my sort key being used?

2019-08-13 00:38发布

I have a table with a column updated_at, which is a sort key. After running both VACUUM and ANALYZE on the table, this is the query plan I get when filtering on updated_at:

EXPLAIN
SELECT *
FROM my_table
WHERE updated_at > '2018-01-01';

QUERY PLAN
XN Seq Scan on my_table  (cost=0.00..0.00 rows=1 width=723)
  Filter: (updated_at > '2018-01-01 00:00:00'::timestamp without time zone)

My understanding is that the query execution engine is doing a sequential scan of the table despite the sort key and thus the sort key is not doing anything?

2条回答
祖国的老花朵
2楼-- · 2019-08-13 00:49

Have a look at the "Analyzing the Query Summary" section in the Redshift docs. It shows how you can use the SVL_QUERY_SUMMARY view to observe extremely detailed metrics for each query execution.

For the most effective sort key usage you should see the rr_scan (range restricted scan) column set to t and the num_rows_pre_filter should be fairly close to the rows count. num_rows_pre_filter is the number of rows scanned off of disk before applying your predicate filter. NB: "Fairly close" in this context will vary depending on your specific data.

SELECT stm,seg,step,TRIM(LEFT(label,30))"label"
      ,rows_pre_filter,rows,avgtime,bytes,is_rrscan 
FROM svl_query_summary 
WHERE query  = 123456
ORDER BY stm ,seg ,step; 

| stm | seg | step |             label             | rows_pre_filter | rows | avgtime | bytes | is_rrscan |
|-----|-----|------|-------------------------------|-----------------|------|---------|-------|-----------|
|   0 |   0 |    0 | scan   tbl=428142 name=my_tbl |          103665 |    6 |   52814 |  1273 | t         |
|   0 |   0 |    1 | project                       |               0 |    6 |   52814 |     0 | f         |
|   0 |   0 |    2 | sort   tbl=303                |               0 |    6 |   52814 |  1288 | f         |
|   1 |   1 |    0 | scan   tbl=303 name=Internal  |               0 |    6 |      74 |  1288 | f         |
|   1 |   1 |    1 | return                        |               0 |    6 |      74 |     0 | f         |
|   1 |   2 |    0 | merge                         |               0 |    0 |     275 |     0 | f         |
|   1 |   2 |    1 | project                       |               0 |    6 |     275 |     0 | f         |
|   1 |   2 |    2 | return                        |               0 |    6 |     275 |  1387 | f         |
查看更多
冷血范
3楼-- · 2019-08-13 00:55

Sequential scans are perfectly normal on Amazon Redshift, since it does not use indexes.

The system is smart enough to skip-over blocks that do not contain desired values, since each block (which contains data for only one column) stores the minimum and maximum of each value in the block. Thus, a block with all dates prior to 2018-01-01 would be automatically skipped.

This will not appear in the EXPLAIN plan, since it depends on the actual data stored in each block.

The best thing is to run some tests and see whether it is running quickly, as you would expect. You would want to run SET enable_result_cache_for_session TO OFF to stop caching from impacting results.

Also, try to avoid situations where the SORT KEY is cast to a different type. In the above example, if the column is a DATE but the query is using it as a TIMESTAMP, then it might not be able to skip over blocks since it has to transform the value after it is read from disk. Therefore, it might work better if there WHERE is using the exact same datatype.

查看更多
登录 后发表回答