Optimizing queries based on clustered and non-clus

2019-04-12 13:18发布

问题:

I have been reading lately about how clustered index and non-clustered index works. My understanding in simple terms (correct me if wrong):

The data structure that backs clustered and non-clustered index is B-Tree

Clustered Index: physically sorts the data based on the index column (or key). you can have only one clustered Index per table. If no index is specified during table creation, SQL server will automatically create a clustered Index on the primary key column.

Q1: Since data is physically sorted based on index, there is no extra space needed here. is this correct? so what happens when I drop the index I created?

Non-clustered Index: In non-clustered indexes, the leaf-node of the tree contains the columns values and a pointer (row locator) to the actual row in the database. Here there is extra space needed for storing this non-clustered index table physically on disk. However, one is not limited by the number of non-clustered Indexes.

Q2: Does it mean a query on non-clustered index column will not result in the sorted data?

Q3: There is an extra look-up associated here to locate the actual row data using the pointer at the leaf node. How much performance difference would this be when compared to a clustered index?

Excercise:

consider an Employee table:

CREATE TABLE Employee
(
PersonID int PRIMARY KEY,
Name varchar(255),
age int,
salary int
); 

Now I created an employee table ( a default clustered index on employee is created).

Two frequent queries on this table happen only on age and salary columns. For sake of simplicity, lets assume that the table is NOT frequently updated

for example:

select * from employee where age > XXX;

select * from employee where salary > XXXX and salary < YYYY;

Q4 : what is the best way to construct indexes, so that queries on both these column have similar performance. If I have clustered index on age queries on age column will be faster but than on salary column will be slower.

Q5: On a related note, I have repeatedly seen that indexes (both clustered and non-clustered) should be created on column with unique constraints. why is that? what will happen on failure to do this?

Thank you very much Posts I read are here:

http://javarevisited.blogspot.com/2013/08/difference-between-clustered-index-and-nonclustered-index-sql-server-database.html

http://msdn.microsoft.com/en-us/library/ms190457.aspx

Clustered vs Non-Clustered

What do Clustered and Non clustered index actually mean?

What are the differences between a clustered and a non-clustered index?

How does database indexing work?

回答1:

For SQL Server

Q1 Extra space is only needed for the clustered index if it is not unique. SQL Server will add a 4 byte uniquifier internally to a non-unique clustered index. This is because it uses the cluster key as a rowid in non-clustered indexes.

Q2 A non-clustered index can be read in order. That may aid queries where you specify an order. It may also make merge joins attractive. It will also help with range queries (x < col and y > col).

Q3 SQL Server does an extra "bookmark lookup" when using a non-clustered index. But, this is only if it needs a column that isn't in the index. Note also, that you can include extra columns in the leaf level of indexs. If an index can be used without the additional lookup it is called a covering index.

If a bookmark lookup is required, it doesn't take a high percentage of rows until it's quicker just to scan the whole clustered index. The level depends on row size, key size etc. But 5% of rows is a typical cut off.

Q4 If the most important thing in your application was making both these queries as fast as possible, you could create covering index on both of them:

create index IX_1 on employee (age) include (name, salary);
create index IX_2 on employee (salary) include (name, age);

Note you don't have to specifically include the cluster key, as the non-clustered index has it as the row pointer.

Q5 This is more important for cluster keys than non-cluster keys due to the uniquifier. The real issue though is whether an index is selective or not for your queries. Imagine an index on a bit value. Unless the distribution of data is very skewed, such an index is unlikely to be used for anything.


More info about the uniquifier. Imagine you and a non unique clustered index on age, and a non-clustered index on salary. Say you had the following rows:

age | salary | uniqifier
20  | 1000   | 1
20  | 2000   | 2

Then the salary index would locate rows like so

1000 -> 20, 1
2000 -> 20, 2

Say you ran the query select * from employee where salary = 1000, and the optimizer chose to use the salary index. It would then find the pair (20, 1) from the index lookup, then lookup this value in the main data.



回答2:

I don't know about internals of Microsoft SQL Server, but I can answer for MySQL, which you tagged for your question. The details could vary for other implementations.

Q1. Right, no extra space is needed for the clustered index.

What happens if you drop the clustered index? MySQL's InnoDB engine always uses the primary key (or the first non-null unique key) as the clustered index. If you define a table without a primary key, or you drop the primary key of an existing table, InnoDB generates an internal artificial key for the clustered index. This internal key has no logical column to reference it.

Q2. A order of rows returned by a query that uses a non-clustered index is not guaranteed. In practice, it's the order in which the rows were accessed. If you need rows to be returned in a specific order, you should use ORDER BY in your query. If the optimizer can infer that your desired order is the same as the order in which it will access rows (index order, whether by clustered or non-clustered index), then it can skip the sorting step.

Q3. InnoDB non-clustered index does not have a pointer to the corresponding row at a leaf of the index, it has the value of the primary key. So a lookup in a non-clustered index is really two B-tree searches, the first to find the leaf of the non-clustered index, and then a second search in the clustered index.

This is double the cost of a single B-tree search (more or less), so InnoDB has an extra feature called the Adaptive Hash Index. Frequently-searched values get cached in the AHI, and the next time a query searches for a cached value, it can do an O(1) lookup. In the AHI cache, it finds a pointer directly to the leaf of the clustered index, so it eliminates both B-tree searches, part of the time.

How much this improves total performance depends on how frequently you search for the same value(s) that have been searched before. In my experience, it's typical for the ratio of hash searches vs. non-hash searches to be about 1:2.

Q4. Construct the indexes to serve the queries you need to be optimized. Typically a clustered index is a primary or unique key, and at least in the case of InnoDB, this is required. Neither age nor salary is likely to be unique.

You may like my presentation, How to Design Indexes, Really.

Q5. InnoDB automatically creates an index when you declare a unique constraint. You can't have the constraint without an index existing for it. If you didn't have an index, how would the engine ensure uniqueness when you insert a value? It would need to search the entire table for a duplicate value in that column. The index helps to make unique checks much more efficient.