Why does the presence of primary key on the table

2020-07-18 05:38发布

问题:

I was trying to see the kind of performance gains column-store indexes can provide on a table. The table has about a 3.7 million rows, 11 columns and is stored as a heap (i.e without a primary key). I create a column-store index on the table and run the following query:

SELECT 
    [Area], [Family],
    AVG([Global Sales Value]) AS [Average GlobalSalesValue],
    COUNT([Projected Sales])
FROM 
    dbo.copy_Global_Previous5FullYearSales
WHERE 
    [Year] > 2012  
GROUP BY 
    [Area], [Family]

The create table statement is as follows:

CREATE TABLE [dbo].[copy_Global_Previous5FullYearSales]
(
    [SBU] [NVARCHAR](10) NULL,
    [Year] [INT] NULL,
    [Global Sales Value] [MONEY] NULL,
    [Area] [NVARCHAR](50) NULL,
    [Sub Area] [NVARCHAR](50) NULL,
    [Projected Sales] [MONEY] NULL,
    [Family] [NVARCHAR](50) NULL,
    [Sub Family 1] [NVARCHAR](50) NULL,
    [Sub Family 2] [NVARCHAR](50) NULL,
    [Manufacturer] [NVARCHAR](40) NULL,
    [rowguid] [UNIQUEIDENTIFIER] NOT NULL,
    [ID] [INT] IDENTITY(1,1) NOT NULL,

    PRIMARY KEY CLUSTERED ([ID] ASC)
        WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, 
              IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, 
              ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

The performance gains I get from column-store index in this case is negligible. The query with the column-store index runs nearly as slow as the original query without an index, in some cases, even slower, even though batch mode is processing is used too.

Surprisingly, when I create an ever increasing primary key - ID on the existing table and rebuild the column-store index, I get a 15X improvement on the CPU time and a 3X improvement on the elapsed time.

I don't understand how the addition of a primary key could affect query performance for column-store indexes which store the data in a compressed format anyway. Also primary keys only change the ordering of the pages which in this case, would be none.

Below is the execution plan

回答1:

The presence of a a key changes how the columnstore is built. Because the builder gets its input in order the resulted segments are better candidates for segment elimination. Read more at Ensuring Your Data is Sorted or Nearly Sorted by Date to Benefit from Date Range Elimination:

The most common type of filter in data warehouse queries is by date. Columnstore segment elimination helps you skip entire one-million-row segments if the system can determine that no rows qualify, simply by looking at the minimum and maximum values for a column in a segment. So you usually will want to make sure that your segments are sorted, or nearly sorted, by date, so date filters can be executed as fast as possible.

Your order is by ID but I'm pretty sure that causes functional dependencies side-effects.