I am struggling to understand 4NF, 5NF, and their difference.
Here is the way I would describe 4/5NF (or, how I would describe the steps to achieve it) to someone who doesn't know. I am stating this because this will show what I have really understood.
Typically, a N:N entity relationship should be implemented by having a join table for their possible combinations. If there are 3 or more entities connected with N:N relationships, one should consider carefully:
- The more general(including) solution would be to implement a join table containing all entities as fields, and all the combinations of them as values(rows)
- However, if the relationship of these entities is not realy a
per-full-tuple case, but rather the (cartesian) product of (some of)
their dyadic N:N relationships, then carefully think the minimum
amount of two-field tables needed.
- Generalizing 2, always prefer (if it is correct, of course) to have join tables with as few fields as possible. And oviously, do not create a join table if there is no use for it.
- A helpful tip to distinguish the above is, when an insert is done, if your heart(!) tells you you are doing redundunt, or invalid, things, then you should choose one of the later methods.
E1) Example of Wikipedia's page on 4NF:
https://en.wikipedia.org/wiki/Fourth_normal_form
We have entities Restaurant, Pizza Variety, and Delivery Area. We could implement their many-to-many relationships with one join table including all three. However, if one thinks of the data correctly, these triplets are a product of only 2 N:N relationships: Restaurant:Pizza and Restaurant:Delivery Area. If the "A1 Pizza" Restaurant decided to include "Thin Crust" Pizza Variety to its repertoire, then I'd have to either insert one row with the same restaurant/pizza variety to all delivery areas of "A1 Pizza" which would feel 4.redundant, or only insert for a specific delivery area, which would feel 4.invalid, because no shop would offer less variety to a delivery area (or at least, let's say our specification says so).
E2) Example of Wikipedia's page on 5NF:
https://en.wikipedia.org/wiki/Fifth_normal_form
We have entities Salesman, Brand, and Type. We could implement their many-to-many relationships with one join table including all three. However, because of the "the following rule applies" part, the triplets are actually a (cartesian) product of the 3 N:N relationships available, and as such, the correct method is to have 3 join tables for it. The "Note how this setup helps to remove redundancy." part is much like my 4.th point.
That case is made even more confusing by the fact that while the article states "Also note that the table is in 4NF", the truth is that if the table had all the rows it should so as to cover the "following rule", then it would not cover 4NF! Right?
So.. What is the
difference between E1 and E2 which makes one of them a 4NF and the
other a 5NF example?
What is the difference between E1 and E2 which makes one of them a 4NF and the other a 5NF example?
Non-4NF & non-5NF relations both exhibit update anomalies due to JDs; 4NF means no anomalies from binary JDs & 5NF means no anomalies from JDs of any arity. The Wikipedia example normalization to 4NF got rid of a binary JD--the relation was a problematic 2-way join. The normalization to 5NF got rid of a 3-way JD--the relation was a problematic 3-way join. (Since it started in 4NF, it couldn't have had any problematic binary JDs.)
A relation (value or variable) is in 5NF when for every way it can be losslessly decomposed (ie into projections that join back to it) (ie a corresponding JD (join dependency) holds) the components can be joined back in some order where the common columns of each join are a superkey of the original. (Fagin's PJ/NF paper's membership algorithm.) The definition of 4NF is the same except that only the ways that it can be losslessly decomposed into two projections matters (ie the corresponding JD is binary) (ie a corresponding MVD (multi-valued dependency) holds).
(Such a permissible JD having such a sequence of joins is said to be "implied by the CKs (candidate keys)".)
The idea is that if we can decompose into projections that join back to an original then we should, except that a join on a superkey does not cause any problems/anomalies.
When a FD (functional dependency) S -> A holds in a relation with attribute set R, the relation is losslessly decomposable on S U {A} & R - {A}. So JD {S U {A}, R - {A}} holds & MVD S ->> {A} holds.
From Which highest normal form is this table in?
Relation Meanings/Predicates
On the other hand, suppose you knew the relation's meaning to the
extent that you knew that it holds tuples that make a true statement
from a (characteristic) predicate expressible as the conjunction of
others, say
ticket Ticket was submitted by a person with first name Vname
AND there is a person with name Vname Nname
AND ticket Ticket was submitted by a person with last name Nname
Join is designed so that the predicate of its output is the AND of the
predicates of its inputs. So you would know to check for whether any
corresponding decompositions of the original satisfy the JD (ie
whether the relations from the conjuncts are projections of the
original) and so to check whether the JD is implied by the original's
CKs.
The point of normalization to higher NFs is that a JD holds when a
relation's predicate can be expressed as the conjunction of others and
their relations are projections of the original, so we can use the
simpler separate relations instead, except we might as well JOIN/AND
the relations/predicates on pairwise shared CKs because there are
still no update anomalies. (If FD {x, ...} -> a
holds then a certain
MVD holds & a certain binary JD holds and the predicate of the
relation can be expressed as ... AND a = f(x, ...)
.)
The difference is not very important because 4NF itself is not important unless you are interested in the history of database design theory.
5NF requires that every join dependency (JD) satisfied by a table must be implied by the superkeys of that table. 4NF is concerned only with the concept of multi-valued dependency (MVD) but since a MVD always implies the existence of a corresponding JD there is no need to be concerned with 4NF at all. The historical reason for the existence of 4NF is that it happened to be invented first and was then effectively superseded by 5NF - just as 3NF was superseded by EKNF/BCNF.
Here is another question asking to clarify the nature of 5NF, and it has a more hands-on example to go by : How to understand the 5th Normal Form? . It might help to put the more abstract answers given here next to the "practical case" from that other question.