Why do database guys go on about normalisation?
What is it? How does it help?
Does it apply to anything outside of databases?
Why do database guys go on about normalisation?
What is it? How does it help?
Does it apply to anything outside of databases?
Normalization is one of the basic concepts. It means that two things do not influence on each other.
In databases specifically means that two (or more) tables do not contain the same data, i.e. do not have any redundancy.
On the first sight that is really good because your chances to make some synchronization problems are close to zero, you always knows where your data is, etc. But, probably, your number of tables will grow and you will have problems to cross the data and to get some summary results.
So, at the end you will finish with database design that is not pure normalized, with some redundancy (it will be in some of the possible levels of normalization).
Before directly jumping into the topic ‘Database Normalization and its types’, we need to understand data redundancy, insertion/update/deletion anomalies, Partial dependency and transitive functional dependency.
What is data redundancy and update/modification anomaly?
Data redundancy is the unnecessary duplication of data in multiple tables within the database or even within the same table. It increases the size of the database unnecessarily and decreases the efficiency of the database by causing data inconsistency.
Example:
Here the ‘student_age’ for the student Alex is repeated unnecessarily which naturally increases the data redundancy. When the column ‘student_age’ has to be changed in the future, then update has to be performed on both rows of the student Alex as in the table above. This scenario is known as update anomaly. If the user updates only one row and forgets to update the other row will cause data inconsistency.
What is insertion anomaly?
Insertion anomaly occurs when certain values for an attribute* cannot be inserted into a table without the existence of the additional data related to that particular value.
Example:
Here the ‘student_name’ and ‘exam_registered’ are assumed to be a composite primary key (primary key that contain multiple columns). Primary key should be always unique, should not hold NULL values and it must uniquely identify each row in a table. Now assume that the high school is trying to introduce a new exam called Chemistry. In the beginning no student has been registered in this course. Since the above table will not accept NULL value in the column ‘student_name’ we need wait until at least one student has been registered to make entry for the exam Chemistry in the above table.
What is deletion anomaly?
Deletion anomaly occurs occurs when certain important values of an attribute* are lost because of the deletion of other not required values.
Example:
Here the ‘student_name’ and ‘exam_registered’ are assumed to be a composite primary key (primary key that contain multiple columns). Primary key should be always unique and should not hold NULL values and it must uniquely identify each row in a table. Now assume that the student named John has cancelled his registration for the exam named English. Since the column ‘student_name’ cannot hold NULL value we will be forced to delete the entire row which cost us the loss of the exam named English from our table. But still high school offers the possibility of taking English exam to their students.
What is partial dependency?
A table is said to be in partial dependency when a non-primary key attribute in that table is fully dependent on a part of the composite primary key attribute in that table.
Example:
Consider a table that has 3 columns named ‘student_name’ , ‘student_age’ and ‘exam_registered’ as above. Here ‘student_name’ and ‘exam_registered’ can together form a composite primary key. Normally each non primary key column in a well normalized table should always depend on the complete set of composite primary key. Here ‘student_age’ depends only on the ‘student_name’ and it does not related to ‘exam_registered’ which causes this table to be in partial dependency.
What is transitive functional dependency?
A table is said to be in transitive functional dependency when a non-primary key attribute in that table more strongly depends on another non-primary key attribute in that table.
Example:
In the above table the relation between the non primary key attribute ‘postal_code’ and another non primary key attribute ‘City’ is much more stronger than the relation between the primary key attribute ‘student_id’ and the non primary key attribute ‘postal_code’. This causes the above table to be in transitive functional dependency.
With the better understanding of the above concepts we can now dive into the normalization of tables in databases.
Table without normalization
A sample denormalized table is given below which will be normalized in the incremental steps in this article.
In the below example for the student_id=2, there are 2 entries because of different parent id’s. Here we can assume like Parent_id=1 represents the father and Parent_id=3 represents the mother of this student whose student_id=2.
Example:
First Normal Form (1NF)
Step 1:
Rule 1 is satisfied in the above step but still it does not satisfy rule 2 and rule 3.
Step 2: The tables below now satisfies the Rule 1, Rule 2 and Rule 3 of 1NF.
Second Normal Form (2NF)
Except the first table all the other tables from 1NF satisfies 2NF. In the first table the ‘age’ column depends only on column ‘student_id’. This violates the Rule 2 of 2NF. Because all the non key columns should depend the primary key columns completely. So the normalized tables as per 2NF are given below.
Third Normal Form (3NF)
Usually a relational database table is often described as ‘normalized’ if it meets 3NF. Most 3NF tables are free from insert, update and delete anomalies.
Except the last table all the other tables from the 2NF satisfies 3NF. This is because the column ‘city’ more strongly depends on the column ‘postal_code’ than the primary key ‘student_id’ which makes the column ‘city’ to be transitive functional dependent on the column ‘student_id’. So the final normalized tables as per 3NF are given as below.
*Attribute:
– Consider a table of students. Here student_name, age etc., are considered as the attributes which will be the title for the corresponding columns.
======================================================================== Simple examples - database normalization
The rules of normalisation (source: unknown)
... So help me Codd.
Normalization is a step wise formal process that allows us to decompose database tables in such a way that both data redundancy and update anomalies are minimized.
Normalization Process
Courtesy
First normal form if and only if the domain of each attribute contains only atomic values (an atomic value is a value that cannot be divided), and the value of each attribute contains only a single value from that domain(example:- domain for the gender column is: "M", "F". ).
First normal form enforces these criteria:
Second normal form = 1NF + no partial dependencies i.e. All non-key attributes are fully functional dependent on the primary key.
Third normal form = 2NF + no transitive dependencies i.e. All non-key attributes are fully functional dependent DIRECTLY only on the primary key.
Boyce–Codd normal form (or BCNF or 3.5NF) is a slightly stronger version of the third normal form (3NF).
Note:- Second, Third, and Boyce–Codd normal forms are concerned with functional dependencies. Examples
Fourth normal form = 3NF + remove Multivalued dependencies
Fifth normal form = 4NF + remove join dependencies