Our database has many tables which follow an "insert only" schema. Rows are added to the end, the "current" value can then be found by finding the most recently inserted row for each logical key.
Here's an example:
CREATE TABLE [dbo].[SPOTQUOTE](
[ID] [numeric](19, 0) NOT NULL,
[QUOTETYPE] [varchar](255) NOT NULL,
[QUOTED_UTC_TS] [datetime] NOT NULL,
[QUOTED_UTC_MILLIS] [smallint] NOT NULL,
[RECEIVED_UTC_TS] [datetime] NOT NULL,
[RECEIVED_UTC_MILLIS] [smallint] NOT NULL,
[VALUE_ASK] [float] NULL,
[VALUE_BID] [float] NULL,
[FEEDITEM_ID] [numeric](19, 0) NOT NULL,
[SAMPLING] [int] NOT NULL,
CONSTRAINT [SPOTQUOTE_pk1] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
The logical key of this table is "feeditem_id". However, so that we can perform historical queries, instead, rows are only ever inserted into this table at the end, using "ID" as the actual physical key.
Therefore, we know that the max(id) for each distinct feeditem_id is going to be found towards the end of the table, not the beginning.
When querying the table, we want to find the "latest" update for each "feeditem_id", which is the "logical key" for this table.
The following is the query we want:
select feeditem_id, max(id)
from spotquote
group by feeditem_id
having feeditem_id in (827, 815, 806)
so that we have the latest id for each feeditem_id.
Unfortunately, SQL server 2008 generates a sub optimal query plan for this query.
From my understanding of SQL, the fact that this is selecting for the max id, which is the primary clustered key, implies that the optimal query plan for this is to:
- Start at the end of the table
- Walk backwards, keeping track of the max id encountered so far for each feeditem_id
- Stop once an Id for each feeditem_id has been found
I would expect this to be extremely fast.
First question: is there some way I can explicitly tell SQL server to execute the above query plan?
I have tried:
SELECT feeditem_id, max(ID) as latest from SPOTQUOTE with (index(SPOTQUOTE_pk1)) group by feeditem_id
having FEEDITEM_ID in (827, 815, 806)
But, in practise, it seems to execute even more slowly.
I am wondering if the "clustered index scan" is walking the table forwards instead of backwards... Is there a way I can confirm if this is what is happening?
How can I confirm that this clustered index scan works from the back of the table, and how can I convince SQL server to search the clustered index backwards instead?
Update
The issue is indeed that the clustered index scan is not searching backwards when I perform a group by.
In contrast, the following SQL query produces essentially the correct query plan:
select
FEEDITEM_ID, MAX(id) from
(select top 100 * from
SPOTQUOTE
where FEEDITEM_ID in (827,815,806)
order by ID desc) s
group by feeditem_id
I can see it in management studio that "Ordered = True" and "Scan Direction = BACKWARD":
and it executes blindingly fast - 2 milliseconds - and it "almost certainly" works.
I just want it to "stop" once it has found an entry for each feed id rather than the first 100 entries.
It's frustrating that there seems to be no way to tell SQL server to execute this obviously more efficient query.
If I do a normal "group by" with appropriate indexes on feeditem_id and id, it's faster - about 300 ms - but still that's still 100 times slower than the backwards clustered index scan.
SQL Server is unable to produce such a query plan as of 2012. Rewrite the query:
This results in a plan that does a seek into the
spotquote
table per ID that you specify. This is the best we can do. SQL Server is unable to abort an aggregation as soon as all groups you are interested in have at least one value.