Introduction and Background
I had to optimize a simple query (example below). After rewriting it several times I recognized that the estimated row count on the one and same index operation differs depending on the way the query is written.
Originally the query did a clustered index scan, as the table in production contains a binary column the table is quite large (about 100 GB) and the full table scan takes too much time to execute.
Question
Why is the estimated row count different on the same index operation (example will show)? What is the optimizer doing here?
the example database - I am using SQL Server 2008 R2
I tried to create a very simplyfied version of my production tables that shows the behaviour.
-- CREATE THE SAMPLE TABLES
----------------------------
CREATE TABLE dbo.MasterTable(
MasterId smallint NOT NULL,
Name varchar(5) NOT NULL,
CONSTRAINT PK_MasterTable PRIMARY KEY CLUSTERED (MasterId ASC)
) ON [PRIMARY]
GO
CREATE TABLE dbo.DetailTable(
DetailId bigint IDENTITY(1,1) NOT NULL,
MasterId smallint NOT NULL,
Name nvarchar(50) NOT NULL,
CreateDate datetime NOT NULL,
CONSTRAINT PK_DetailTable PRIMARY KEY CLUSTERED (DetailId ASC)
) ON [PRIMARY]
GO
ALTER TABLE dbo.DetailTable
ADD CONSTRAINT FK1
FOREIGN KEY(MasterId) REFERENCES dbo.MasterTable (MasterId)
GO
CREATE NONCLUSTERED INDEX IX_DetailTable
ON dbo.DetailTable( MasterId ASC, Name ASC )
GO
-- INSERT SOME SAMPLE DATA
----------------------------
SET NOCOUNT ON
GO
-- These are some Codes. In our system we always use these codes to search for "types" of data.
INSERT INTO dbo.MasterTable (MasterId, Name)
VALUES (1, 'N1'), (2, 'N2'), (3, 'N3'), (4, 'N4'), (5, 'N5'), (6, 'N6'), (7, 'N7'), (8, 'N8')
GO
-- ADD ROWS TO THE DETAIL TABLE
-- Takes about 1 minute to run
-- Don't care about the logic, it's just to get a distribution similar to production system
----------------------------
declare @x int = 1
DECLARE @MasterID INT
while (@x <= 400000)
begin
SET @MasterID = ABS(CHECKSUM(NEWID())) % 8 + 1
INSERT INTO dbo.DetailTable(MasterId,Name,CreateDate)
VALUES(
CASE
WHEN @MasterID IN (1, 3, 4) AND @x % 20 != 0 THEN 2
WHEN @MasterID IN (5, 6) AND @x % 20 != 0 THEN 7
WHEN @MasterID = 8 AND @x % 100 != 0 THEN 7
ELSE @MasterID
END,
NEWID(),
DATEADD(DAY, - ABS(CHECKSUM(NEWID())) % 1000, GETDATE())
)
SET @x = @x + 1
end
go
-- DO THE INDEX AND STATISTIC MAINTENANCE
----------------------------
alter index all on dbo.DetailTable reorganize
alter index all on dbo.MasterTable reorganize
update statistics dbo.DetailTable WITH FULLSCAN
update statistics dbo.MasterTable WITH FULLSCAN
go
Preparation is done, let's start with the query
Let's have a look at the statistics first, look at RANGE_HI_KEY=8
, there are 489 EQ_ROWS
-- CHECK THE STATISTICS
----------------------------
dbcc show_statistics ('dbo.DetailTable', IX_DetailTable)
GO
Now we do the query. The first one is the original query I had to optimize. Please activate the current execution plan when executing. Have a look at the operation "index seek (nonclustered) [DetailTable].[IX_DetailTable]"
-- ORIGINAL QUERY
----------------------------
SELECT d.DetailId
FROM dbo.DetailTable d
INNER JOIN dbo.MasterTable m ON d.MasterId = m.MasterId
WHERE m.Name = 'N8'
AND d.CreateDate > '20150312 11:00:00'
GO
-- FORCESEEK
----------------------------
SELECT d.DetailId
FROM dbo.DetailTable d WITH (FORCESEEK)
INNER JOIN dbo.MasterTable m ON d.MasterId = m.MasterId
WHERE m.Name = 'N8'
AND d.CreateDate > '20150312 11:00:00'
GO
-- Actual: 489, Estimated 50.000
-- TABLE VARIABLE
----------------------------
DECLARE @MasterId AS TABLE( MasterId SMALLINT )
INSERT INTO @MasterId (MasterId)
SELECT MasterID FROM dbo.MasterTable WHERE Name = 'N8'
SELECT d.DetailId
FROM dbo.DetailTable d WITH (FORCESEEK)
INNER JOIN @MasterId m ON d.MasterId = m.MasterId
WHERE d.CreateDate > '20150312 11:00:00'
GO
-- Actual: 489, Estimated 40.000
-- TEMP TABLE
----------------------------
CREATE TABLE #MasterId( MasterId SMALLINT )
INSERT INTO #MasterId (MasterId)
SELECT MasterID FROM dbo.MasterTable WHERE Name = 'N8'
SELECT d.DetailId
FROM dbo.DetailTable d --WITH (FORCESEEK)
INNER JOIN #MasterId m ON d.MasterId = m.MasterId
WHERE d.CreateDate > '20150312 11:00:00'
-- Actual 489, Estimated 489
DROP TABLE #MasterId
GO
Analyse and final question(s)
Please have a look at the operation "index seek (nonclustered) [DetailTable].[IX_DetailTable]"
The comments in the script above show you the values I got for estimated and actual row count.
In our production environment this table has 33 million rows, the estimated rows in the queries above differ from 3 million to 16 million.
To summarize:
when a join between the DetailTable and the MasterTable is made, the estimated rowcount is 12,5% (there are 8 values in the master table, it makes sense, kind of...)
when a join between the DetailTable and the table variable is made, the estimated rowcount is 10%
when a join between the DetailTable and the temp table is made, the estimated rowcount is exactly the same as the actual row count
The question is why do these values differ?
The statistics are up to date and making an estimation should really be easy.
I just would like to understand this.