How are clustered indexes stored on a hard disk? What is the logical order?
How do non-clustered indexes work?
How are clustered indexes stored on a hard disk? What is the logical order?
How do non-clustered indexes work?
This means that the data in the table are stored in a B-Tree
according to the order of the CLUSTERED PRIMARY KEY
(or the clustering columns).
This name is in my opinion a little bit confusing. The same concept in Oracle
is called index-organized table
which I find much more descriptive.
Non-clustered indexes contain the value of the indexed columns along with the pointer to the record they are originated from.
The "clustered index" is the table itself; the "non-clustered" index is an ordered copy of some of the table's columns.
If you "create" a clustered index, the table is rearranged. That's why you cannot have more than one "clustered index" on a table: the table cannot be arranged in more than one order.
If you create a secondary index, the shadow copy of the table is created, holding the values of the indexed columns and the pointers to the records they are from. Whenever the table changes, the copy is changed too (the engine takes care of that automatically).
id col1 value
-- -- --
1 1 Data 1
6 1 Data 6
3 1 Data 3
7 2 Data 7
9 2 Data 9
5 2 Data 5
The table is not ordered.
id col1 value
-- -- --
1 1 Data 1
3 1 Data 3
5 2 Data 5
6 1 Data 6
7 2 Data 7
9 2 Data 9
The table is ordered on id
.
Table Index
id col1 value col1 id
-- -- -- -- --
1 1 Data 1 1 1
3 1 Data 3 1 3
5 2 Data 5 1 6
6 1 Data 6 2 5
7 2 Data 7 2 7
9 2 Data 9 2 9
The table is orderer on id
, the index is ordered on (col1, id)
For non-clustered indexes, a separate file is created, which hold just the index fields, which has it's records placed in the logical index order. For clustered index, there is no separate file -- the data from the table itself (all the fields) is placed in the logical order of the index.
This makes lookup on the index faster (though it's really best of indexes like dates where you'll be looking up a range). It also makes inserts rather slow if the record will be inserted in the middle.
It means that a clustered index determined the physical order in which records in a table are actually stored. Non-clustered indices are simply lists of key values stored separately that enable fast lookups in other orderings than the clustered/physical ordering.
Quick example: a table with ID
(primary key), FirstName
, LastName
and Car
containing three people: 0=The Stig (Llana), 1=Jeremy Clarkson (DB9), 2=Richard Hammond (911), 3=James May (Lambo) and a clustered index on LastName
and a non-clustered index on Car
would store the actual lines of data in the table in this physical order on disk:
ID FirstName LastName Car
1 Jeremy Clarkson DB9
2 Richard Hammond 911
3 James May Lambo
0 The Stig Llana
The nonclustered index would also store something like:
Car ID
911 2
DB9 1
Lambo 3
Llana 0
Clustered Index Storage
Clustered indexes fundamentally work the exact same way that all other indexes work -- they're stored inside a variant of a struture called a B-Tree. They're stored in the same files, with the same formats as all of your other tables in SQL Server.
The Concept
Step back and think of the data that you're indexing. (I want you to think of a book in this analogy). What if, in addition to having indexes at the end of the book, you also ordered the data inside the book as well? You could look up information much faster. Take, for example, a phone book where all of the data is ordered by last name and first name. You don't have to go to the back of the phone book to find someone's number. Contrast that with a history book, where you have to go to the index at the back of the book to find what you want.
So logically, a clustered index (or "index-organized table" in Oracle) is your data, but sorted. Physically, the leaf nodes of the B-tree contain all of your table's data, in sorted order. This is really helpful when you are scanning the data in your table on a contiguous range, such as a date range.
Another important thing about clustered indexes (in SQL Server at least) is that your clustering columns (that is, the columns that make up how you sort your clustered index) are included at the end of each nonclustered index you define on your table. This makes searching for your clustering columns very fast, and this is often very desirable in OLAP databases.
Nonclustered Indexes
Your table can only be stored in one physical order. But certain times you need to look up data in other ways. For these scenarios, you use a nonclustered index. This is implemented as a B-Tree as well, but it doesn't have any bearing on the order of your table's data, like a clustered index does. That means that if you want data from your table which is not included in your non-clustered index, SQL Server will have to physically look up the data in your table in order to get what you want. This is another operation, and for many queries can be costly, and is a key design consideration when you optimize your tables.
A word
You could write a book on this stuff. Many have. If I haven't bored you to death already, check out Wikipedia's B-Tree page. Start there. If you're still (really) interested, I suggest actually programming a simple B-Tree so you can see what's involved. And, if you want to know even deeper details on how exactly SQL Server stores all of this, check out Kalen Delaney's Inside SQL Server: The Storage Engine. Is all of this learning overkill? That's for you to decide. But the more you study this, the more comfortable you will be with DB development, and the faster your systems will become. I promise.
it means that the table is ordered as specified for the clustered index. Non clustered index are physically stored separately.
Primary indexes are not technically "clustered" indexes, though both cause a physical sort ordering to the data. The difference is apparent in their very names. A primary index deals with primary keys. Meaning, each primary key must be unique (otherwise it would not be a primary key). A clustering index deals with anything that is not a primary key and, by definition, may be allowed to be non-unique. This is where the word "cluster" comes from. If you sort data that is not primary, it means it can repeat. When repeated data appears together, it is considered a "cluster."