When designing tables, I've developed a habit of having one column that is unique and that I make the primary key. This is achieved in three ways depending on requirements:
- Identity integer column that auto increments.
- Unique identifier (GUID)
- A short character(x) or integer (or other relatively small numeric type) column that can serve as a row identifier column
Number 3 would be used for fairly small lookup, mostly read tables that might have a unique static length string code, or a numeric value such as a year or other number.
For the most part, all other tables will either have an auto-incrementing integer or unique identifier primary key.
The Question :-)
I have recently started working with databases that have no consistent row identifier and primary keys are currently clustered across various columns. Some examples:
- datetime/character
- datetime/integer
- datetime/varchar
- char/nvarchar/nvarchar
Is there a valid case for this? I would have always defined an identity or unique identifier column for these cases.
In addition there are many tables without primary keys at all. What are the valid reasons, if any, for this?
I'm trying to understand why tables were designed as they were, and it appears to be a big mess to me, but maybe there were good reasons for it.
A third question to sort of help me decipher the answers: In cases where multiple columns are used to comprise the compound primary key, is there a specific advantage to this method vs. a surrogate/artificial key? I'm thinking mostly in regards to performance, maintenance, administration, etc.?
What is special about the primary key?
What is the purpose of a table in a schema? What is the purpose of a key of a table? What is special about the primary key? The discussions around primary keys seem to miss the point that the primary key is part of a table, and that table is part of a schema. What is best for the table and table relationships should drive the key that is used.
Tables (and table relationships) contain facts about information you wish to record. These facts should be self-contained, meaningful, easily understood, and non-contradictory. From a design perspective, other tables added or removed from a schema should not impact on the table in question. There must be a purpose for storing the data related only to the information itself. Understanding what is stored in a table should not require undergoing a scientific research project. No fact stored for the same purpose should be stored more than once. Keys are a whole or part of the information being recorded which is unique, and the primary key is the specially designated key that is to be the primary access point to the table (i.e. it should be chosen for data consistency and usage, not just insert performance).
It was said that primary keys should be as small as necessary. I would says that keys should be only as large as necessary. Randomly adding meaningless fields to a table should be avoided. It is even worse to make a key out of a randomly added meaningless field, especially when it destroys the join dependency from another table to the non-primary key. This is only reasonable if there are no good candidate keys in the table, but this occurrence is surely a sign of a poor schema design if used for all tables.
It was also said that primary keys should never change as updating a primary key should always be out of the question. But update is the same as delete followed by insert. By this logic, you should never delete a record from a table with one key and then add another record with a second key. Adding the surrogate primary key does not remove the fact that the other key in the table exists. Updating a non-primary key of a table can destroy the meaning of the data if other tables have a dependency on that meaning through a surrogate key (e.g. a status table with a surrogate key having the status description changed from ‘Processed’ to ‘Cancelled’ would definitely corrupt the data). What should always be out of the question is destroying data meaning.
Having said this, I am grateful for the many poorly designed databases that exist in businesses today (meaningless-surrogate-keyed-data-corrupted-1NF behemoths), because that means there is an endless amount of work for people that understand proper database design. But on the sad side, it does sometimes make me feel like Sisyphus, but I bet he had one heck of a 401k (before the crash). Stay away from blogs and websites for important database design questions. If you are designing databases, look up CJ Date. You can also reference Celko for SQL Server, but only if you hold your nose first. On the Oracle side, reference Tom Kyte.
I too always use a numeric ID column. In oracle I use number(18,0) for no real reason above number(12,0) (or whatever is an int rather than a long), maybe I just don't want to ever worry about getting a few billion rows in the db!
I also include a created and modified column (type timestamp) for basic tracking, where it seems useful.
I don't mind setting up unique constraints on other combinations of columns, but I really like my id, created, modified baseline requirements.
We do a lot of joins and composite primary keys have just become a performance hog. A simple int or long takes care of many problems even though you are introducing a second candidate key, but it's a lot easier and more understandable to join on one field versus three.