Redshift allows designating multiple columns as SORTKEY
columns, but most of the best-practices documentation is written as if there were only a single SORTKEY.
If I create a table with SORTKEY (COL1, COL2)
, does that mean that all columns are stored sorted by COL1, then COL2? Or maybe, since it is a columnar store, each column gets stored in a different order? I.e. COL1 in COL1 order, COL2 in COL2 order, and the other columns unordered?
My situation is that I have a table with (among others) a type_id and a timestamp column. Data arrives roughly in timestamp order. Most queries are joined against / restricted by both type_id and timestamp. Usually the type_id clauses are more specific, meaning a much larger percentage of rows can be excluded by looking at the type_id clause than by looking at the timestamp clause. type_id is the DISTKEY for this reason. I'm trying to understand the pros and cons of SORTKEY (type_id)
, SORTKEY (stamp)
, SORTKEY (type_id,stamp)
, SORTKEY (stamp,type_id)
.
Thanks.
I will say the order for
sort_key
should bethe general rule: lower cardinality put first if same level.
We are also using Redshift and we have about 2 billion records (+20 million every day) and I have to say, the less selective the sort_key is, the more ahead it should be in the sort_key list.
In our case (and please be advised to analyze how you use/query your own data) we used timestamp as first sort_key. The problem with this is, that even within 1 second we record about 200 rows, which results our 1MB blocks contain only a few seconds, and every type of data in that single block. Meaning, even though timestamp is highly selective, after we cannot really filter further as we have all kinds of data in every block.
Recently we have reversed the order of the sort_keys. The first one has about 15 different values, the second has about 30, etc... and timestamp is the last one now, but still, one block is still measured in seconds.
This results, (since we use the first two sort_keys as filters very frequently) the following: Old solution: A year of data, select a month, it drops 91% of the blocks, but after it has to open all of them, even though we want to filter further.
The new solution drops about 14/15 of the blocks in the first step, regardless of the date range, then about 95% of the remaining ones, and timestamp still drops 91% of the remaining ones.
We have tested it thoroughly with two, 800 million records tables, which were the same, except the order of the sort keys. The higher the time-period in the 'where' clause was, the better results we got. It got even more significant in case of joins obviously.
So my suggestion is, know your database and what kind of queries you run frequently, because the most selective column might not be the best first sort_key. Just as Enno Shioji said, it all depends on by what you are filtering.
If you declare
SORTKEY(COL1, COL2)
, all columns will be sorted byCOL1
, thenCOL2
as ifORDER BY (COL1, COL2)
was done.If you are using
SORTKEY
to speed up a JOIN, AFAIU it doesn't matter so long as you use the sameSORTKEY
on the tables that will be joined because what happens is a merge join.If
COL1
is highly selective like yourtype_id
, it means there are only small numbers of rows which has the sametype_id
. Therefore although you can add another column to SORTKEY, its utility is limited because most of the row elimination has already happened.If
COL1
is not highly selective like yourstamp
(which is a bit weird btw; I would have expected it to be more selective thantype_id
? Anyways..), it means that filtering bystamp
won't eliminate that much rows. So it makes more sense to declare a second sort key. However, this is less efficient than the other way around as eliminating rows earlier would be cheaper. If you sometimes filter bystamp
but not bytype_id
, it may make sense to do this though.