In the question How to Set Up Primary Keys in a Relation, PerformanceDBA talked about Relational Integrity and pointed out that it is different from Referential Integrity.
I have heard of Referential Integrity, which is related to Foreign Keys.
But Relational Integrity seems strange to me.
In this question Are relational integrity and referential integrity the same thing?, Chris said the two are the same thing.
The terminologies and definitions in the database world are really confusing me.
Could someone explain the Relational Integrity?
The relational database theory has been established for decades since Edgar F. Codd proposed the Relational Model in his paper in 1970.
However, there aren't consistent definitions for normal forms, integrity, etc, in books or on the web.
As a learner, I am confused.
A database state represents some world situation. Because of what we are supposed to put into a particular database's base tables regarding the world, and because only some world situations can arise, only some database states can arise. It would be an error if the database ever held another state.
Database integrity is about making sure that the database never contains an invalid state. (Eg you would see a textbook chapter titled "Integrity" or "Database Integrity" addressing this topic.)
"Relational integrity" isn't a common term. I would expect it to refer to database integrity of relational databases. Constraints are collectively the description of valid relational database states. A relational DBMS enables their declaration and enforces them.
One could also reasonably use it to mean adherence to the principles of the relational model.
Referential integrity, a subtopic of relational database integrity, is about making sure that the database never contains an invalid state of the kind described by (depending on context) either a foreign key constraint or an SQL FOREIGN KEY constraint, what we might call a foreign superkey constriant. (SQL's FOREIGN KEY definitions can involve references to superkeys rather than just candidate keys as in a foreign key constraint.)
What really matters is the underlying notions of the relational model. Common and ad-hoc terms just let us refer to them.
The Confusion
I will answer your questions in a sequence that makes overall sense, and reduces the length.
I would say, that is the single biggest problem in the database industry. There are two camps.
The Codd Camp
In other words, any qualified person can understand them, and use them (one does not need to be able to read and understand mathematical definitions.)
The high-end RDBMS vendors are squarely in this camp. Along with their customers, over the decades, they lead the SQL committee in implementing features and facilities of the Relational Model, and of SQL, its data sublanguage (note that SQL is not a language, but a sublanguage for accessing the database).
One result has been, the natural progression of the Relational Model. Features that the detractors have suggested are "incomplete", have been "completed", and implemented. Any one who faithfully implements the Relational Model would see that as natural progressions, not as "extensions".
All the small but important advances and progressions have been implemented by genuine theoreticians and scientists who work for the high-end vendors. (That is, neither I, nor the customers, nor the vendors, all of whom proposed such advances, which the vendors implemented, suggest that the Relational Model was "incomplete", or that we "completed" it. We happily accept that such is not "outside" the Relational Model.)
The non-SQLs, and pretend-SQLs do not figure in this category.
"Theoretician"
The days of genuine theoreticians and scientists are long gone. What we have now is people who exist in total isolation from the industry that they allege to be serving; in denial of other science (eg. the laws of physics, and of logic, and of common sense); and who write academic papers that they cite, and therefore elevate. Academic papers and mathematical proofs are based only on the fact that, in such denial and isolation, they prove what they propose to prove. The fact that the proof is pure garbage in the real world, that it can only be "true" if related science is denied, is not relevant to them.
Such denial of reality is schizophrenic. Unfortunately, they teach it at university level these days.
Thus you see papers elevated purely because they have many citations (from fellow schizophrenics), and not on the basis of whether it is science or not.
For a person who has not been "educated" to be schizophrenic, it is easy to destroy such papers. In another answer to you, I have given proof of this:
my Response to Hidders (in which he suggests a "problem" in a "relational database" that he proposes to "solve", and which "problem" disappears in the simple act of placing the data in a Relational context.
my Response to Köhler (in which he suggests a "problem" in a "relational database" that he proposes to "solve" by creating yet another novel "normal form", and which "problem" disappears in the simple act of placing the data in a Relational context.
The bottom line, separate to the evidence that the papers produced by "theoreticians" are founded on the Straw Man method, and thus false, is the evidence that they are clueless about the Relational Model that they allege to be writing about.
It is fraud, on a massive scale, because it is established in the new "education" system.
The Post-Codd Gulag
This camp was made up initially of "theoreticians". In the forty five years since the Relational Model was published, they have not produced a single thing that has advanced or progressed it.
Now it is made up of all their followers and proponents. And because they have written books, which are now used as textbooks in university courses, it includes all the "professors" who teach this garbage, like parrots, without verifying theory for themselves.
They have, however, produced a morass of fragments, that they claim to be "relational". They break Codd's definitions into tiny pieces, and deal with each in isolation, either to attack the Codd definitions, or to support their own fragments.
Eg. where Codd has clear and straight-forward definitions for 1NF, 2NF, 3NF (that last includes the definition for Functional Dependency), and any competent person could apply them, these creatures have:
six fragments of the NFs ("1NF", "2NF", "3NF", "BCNF" (aka "3.5NF"), "4NF", "5NF")
which taken together, do not cover even a fraction of what Codd's three NFs cover.
which is why I have stated in other answers, that if one understands and applies Codd's three NFs, in spirit and word, it covers the above six NF fragments, as well as any NF fragments that has not been written yet (by the "theoreticians").
DKNF is the ultimate NF, and clearly the goal of the Relational Model (to anyone who is genuinely trying to understand it), but not defined as such. This is one of the natural progressions (above), and obvious to a faithful implementer. While I wouldn't say it is easy, it is entirely feasible: all my databases since 2007 are DKNF.
Of course, their fragments are incomplete, they are forever finding "holes"in their definitions, and "filling" them. There is no end to the "holes", and no end to the "fillers". Codd's definitions have no holes, and do not need fillers.
And of course, there is no integration of their numerous fragments. It is the very definition of dis-integration.
They produced a number of "relational algebras" (I can't keep count) in competition with Codd's Relational Algebra. Of course, none of them comes even close, and Codd's is strongly established as the Relational Algebra.
They claim that the Relational Model is somehow "incomplete", and that their inventions "complete" it. They don't. Their inventions stand as evidence that they do not understand the Relational Model that they propose to "complete". It is part of the fraud, to elevate their otherwise incredible inventions.
In doing all of this, of course they have introduced confusion. That is their goal, as detractors of the Relational Model. Their 42 or so "relational models" and "relational algebras" can only exist in a state of confusion, where practitioners and seekers are confused about what the Relational Model is.
This answer is long, only because, in order to answer your question, I have to first remove the confusion, what the Relational Model is not, and second, confirm what the Relational Model is. If these creatures had not introduced all this fraud, this confusion, if the Relational Model was clear, and the definitions were not sabotaged, the answer to the question is simple and straight-forward:
But for the confusion, that is not enough, I have to provide more detail, and proof, to destroy the confusion.
And notice, they have established the notion that everything is a matter of opinion; of argument; it is subjective. Of course, the truth is objective, not open to opinion; discussion; or argument. It is easy to prove: just read the Relational Model and see for yourself if something is defined, and what that definition is.
The Difference
The main difference between the Codd Camp and the "Theoreticians" Gulag, is this:
The post-Codd authors do not understand the Relational Model. Over the four decades, separate to the fact that they have not added one iota to the Relational Model, they have established a private "relational model" (actually several), with their private definitions and private terms. The evidence for this is four-fold:
First, they understand only a tiny fraction of the Relational Model. They are unaware of (eg) Relational Keys; Relational Integrity.
Second, they propagate various inventions that are specifically prohibited in the Relational Model: (eg) surrogates; Access Path Dependence; circular references. Again, that is only possible if they are ignorant of the Relational Model
Third, that they have private definitions for terms, that do not match the definitions in the Relational Model. That alone guarantees that they are divorced from, and do not serve, the industry that they allege to serve. Further, they cannot converse with any Relational Model practitioner.
Fourth, that they have terms (and private definitions) for fabrications that are not in the Relational Model. But they fraudulently declare such inventions to be part of the "relational model".
In sum, it means that what they practice and preach as "relational" or "relational model", is far from Relational, and in fact, by virtue of the evidence, it is Anti-relational.
Since they do not understand the Relational Model, what is it that they do understand, that they propagate ?
Well, from the mountain of evidence (ie. their own writings: books, textbooks, academic papers, etc), they only understand and propagate the technology we had before the advent of the Relational Model and RDBMS: pre-1970's Record Filing Systems.
Which have none of the Integrity (we will get to that in detail), or power, or speed, of Relational databases (ie. one that complies with the Relational Model).
If one were asked to describe the difference between pre-relational DBMS and Relational DBMS, in one sentence, it would be that pre-relational systems related records by Record ID, with no control of the content of records, whereas Relational system related rows by Relational Key, with full control (integrity) of the row content.
The Visible Difference
The difference identified above is theoretical, understandable to people who have a grounding in theory (and denied by the "theoreticians"). However the difference is quite visible even to novices, in two items (there are many, I am exposing the two really obvious ones), and here I can provide specific evidence:
The Relational Model demands a Primary Key, which is then used as a Foreign Key, to establish relationships. The detractors implement Record IDs as "primary key", which:
fail the definition of Key in the Relational Model. (The Relational Model definition of Key is it must be made up from the data. Record IDs, GUIDs, etc, are manufactured, they are not data.)
implements Access Path Dependence, which is specifically prohibited in the Relational Model. Access Path Dependence was the characteristic limitation of pre-relational Record Filing Systems.
(This leads to having to navigate every file between tow distal files, whereas in the Relational Model, two distal tables can be JOINed directly.)
(Which by the way, is proof that the Record Filing Systems in fact, require more, not less, JOINs, contrary to the mythology.)
Thus they elevate surrogates, Record IDs, to the status of "key".
But it is pure fraud. There is no such thing as a "surrogate key" in the Relational world, because the two words contradict each other, either it is a surrogate (a non-key) xor it is a Key (a non-surrogate).
Further, by using the term "surrogate key", one naturally expects at least some, if not all, the qualities of a Key, and a surrogate has none of the qualities of a Key.
That is "normal" in the world of the "theoreticians", which is divorced from the Relational world, because they do not have Keys, they have surrogates as "keys"
They have their invention, along with their private definition, the "candidate key".
It is an invention to hide the fact that they are not using Primary Key as defined in the Relational Model, which in and of itself, is a breach of the Relational Model.
Again, it is pure fraud, because there is no such thing defined in the Relational Model. So, whatever it is, it is outside the Relational Model
What is it ? It is a fragment of a Key. a surrogate alone does not provide row uniqueness, which is demanded in the Relational Model. And they need row uniqueness to establish that small fraction of integrity in their Record Filing System.
Now, note, at this point, to be complete, therefore a surrogate is always an additional column, ie. additional to the data columns. It is never an either/or proposition, as many novices (such as Fowler and Ambler, such as those who propose it to be an "opinion", that hasn't reached "consensus") propose it to be.
It is a fragment of a Relational Key because the "candidate key" is not implemented as a Key in the Record Filing System, across files. It is implemented, only in the single File that it is defined in, and therefore not used Relationally. Whereas in the Relational Model, such a Key would be the Primary Key, and it would be a Foreign Key in every child of that table, ie. it would be used Relationally, across tables.
Thus is remains a fragment of a Relational Key, existing only in the file it is resident in.
Of course "candidate keys" do not work. So they have come up with yet another invention, to make it work. The "superkey". It doesn't work either, and requires massive duplication. It doesn't provide any of the qualities of a Relational Key. It only provides a fragment, a step, above the failed "candidate key", and thus the whole thing remains a failure.
Take Functional Dependence. Because of the introduced confusion and sabotage, we have to call it Full Functional Dependence. Codd' 3NF Definition gives the definition of Functional Dependence as well. This is in plain technical English, it is easily understood, and implemented.
The most important thing to understand is, since we are talking about the Relational Model, when Codd uses the term Key, he means a Relational Key. therefore the Key has to be determined and available first, before Functional Dependence of attributes on the Key can be tested (it is a test), second.
But the saboteurs, the subverters, have invented fragments of the 3NF definition. I won't go into in detail (unless asked), however, if you examine them, you will see that they serve one purpose: to elevate their non-relational "candidate key" fraudulently to the status of a Key.
Further, their entire set of definitions of their fragments (seven or so) that relate to Codd's 3NF definition, is complex, ordinary implementers cannot understand it, let alone implement it.
Summary. If anyone uses the terms { "candidate key", "superkey", "partial/transitive dependence" [rather than Full Functional Dependence] }, they are identifying themselves, categorically, as being (a) ignorant of the Relational Model, and (b) Anti-relarional.
The Result
The result, and this is the real accomplishment of the "theoreticians", is that 95% of the implementers out there implement Record Filing Systems, that have none of the Inteegrity, power, or speed of Relational databases, but they think that their RFS is "relational". It is a great pity that the "theoreticians" cannot acknowledge, and enjoy their one and only accomplishment.
Declarations of "Theoreticians"
This is what we have to understand, in order to penetrate the confusion and to identify falsities as such. If you understand that the "theoreticians" are heavily invested in their 42 or so "relational models" and "relational algebras", that they are quite different to the Relational Model, you will understand that actually, they are quite clueless about the Relational Model.
But this does not stop them from making declarations about the Relational Model, what it does, what it can't do, etc. Therefore do not believe any pronouncement that they make, it is like a pygmy making an pronouncement about airplane flight (refer the Gods Must Be Crazy).
The Question
True. But only for the high end of the industry. Guys like me. Guys who have a sound grounding in theory and science, and who reject the non-science of the post-Codd era. The majority, 95%, and schooled in the anti-relational system.
Yes.
Yes. But only a genuine Relational practitioner can do that.
Be warned, due to the state of the Relational database industry, the introduced confusion, the massive fraud being perpetrated, as detailed above, the saboteurs and subverters will say, either there is no such thing, or there is, but it is impossible to implement, or that it is the same as Referential Integrity.
No, they are not.
As defined above, that statement proves he is clueless about the Relational Model, and that he is Anti-relational. As such, he cannot know what the Relational Model is, what Relational Integrity is. They do not know what they are missing, so they cannot describe or define it.
I can.
Let's start with Referential integrity, so that we know (a) what that is, and (b) how that is different to Relational Integrity.
We need a decent example to work with. Note that frauds and thieves use simple examples, because anything can be proved (or disproved) using simple, trite examples. Deep understanding requires full examples, that are "complex" enough to demonstrate the issue.
Let's use an example that I have given the "theoreticians" on comp.databases.theory to solve. They couldn't solve it. I gave pointers and hints. They still couldn't solve it.
That stands, in and of itself, as evidence that the "theoreticians" cannot Normalise Anything. Despite the fact that they have 17 mathematical definitions for their abnormal, fragmented, "normal forms".
We really should be shouting that from the rooftops. They are no position to be telling practitioners anything.
Here is a typical implementation by a developer who has been reading the books of the detractors, and following them carefully. As is typical, he thinks this is "relational". But it isn't Relational at all, it is a Record Filing System, with none of the Integrity, power, or speed of a Relational Database.
The content should be familiar to everyone, therefore I will skip the description. Note that there are ISO and ANSI/FIPS Standard Codes for the first three levels, eg. ISO-3166-1, 3166-2, and FIPS.
Typical Record Filing System Implementation declared as "relational"
But his "primary keys" are not Primary Keys, they are Record IDs.
The developer has declared this set of files as "satisfies 5NF". The "theoreticians" have passed this as such.
As far as Codd and I are concerned, (a) it fails 3NF and (b) it fails Relational. But we won't be dealing with that here, we just need a good example to use for our purpose, Relational Integrity.
Let's look at the DDL for the Country File.
So far so good. Let's look at the State File.
Notice that (eg) both Canada and Australia have a StateCode "NT", and the Alternate Keys allow that. But also note that when inserting States, we are forced to use a Record ID, instead of data, to identify the parent Country of the State that is being inserted. That should ring alarm bells.
So far so ordinary. Let's look at the County File.
When inserting Counties, we are forced to use a Record ID, instead of data, to identify the parent State of the County that is being inserted. That should ring more alarm bells. Note that (eg) America has a County named "Lee" in 12 States, but not in New York, and Canada has none.
Whoops, we just inserted a Lee County into New York State. This exposes two important issues:
From both the user perspective, and the app that the user uses to maintain the data in the filing system, there is very little data integrity, because there is very little data Identification. (Record IDs are not data, they should not even be seen by the user.)
So even if the app supplied a Record ID for State, before it determined that, it had to ask the user "which State" the County was being inserted into, and provide a drop-dow, which (hopefully) contained StateCodes (minimum) or State Names (better) in alphabetical order.
And then extract the Record ID for the chosen State, abandoning the StateCode.
From purely the data perspective, it is entirely possible Identify a County, and Identify it uniquely (CountryCode-or-Name, StateCode-or-Name, CountyCode-or-Name).
But in the "theoreticians" Filing System, we are prevented.
So what is the problem ?
The problem is, the "theoreticians" do not understand Relational Keys, or Identifiers (IDEF1X term). Because they do not understand it, they don't know what they are missing.
The solution is, the Relational Model, with Relational Keys.
Referential Integrity
There is something that needs to be understood first. What is a Primary Key ? In the SQL context, ie. an implementation, the use of the keyword
PRIMARY KEY
does not magically transform the column named as such, into a Primary Key (as per the Relational Model). It merely enforces uniqueness on the named column.What is Referential Integrity ?
It is merely the use of the
FOREIGN KEY ... REFERENCES
keywords in SQL. It does not magically provide Relational Integrity. It provides only an enforcement in the server (or non-server, for the NON-sqls), that the FK identified in the referencing File, is a PK in the referenced File.Let's stop here for a moment and take stock. The "theoreticians" and post-Codd authors know only RFS, and very little SQL. They know only Referential Integrity (and mostly, they do not implement even that, but let's not get distracted). Thus they are not in a position to make declarations, one way or the other, regarding what they do not know.
Relational Integrity is much more than Referential Integrity.
Relational Sets
Dr E F Codd asks us to think of the data in terms of sets. The "theoreticians" don't understand that. They know of only two sets: the set that contains the entire file; and the empty set.
In the Relational world, there is much more to sets than that.
At the first level, Country, sure, there is just one set, of Country.
At the second level, State, the simple set is all States of all Countries. But there is a more Relational set that we can work with:
At the third level, County, the simple set is all Counties of all Countries. But there are more Relational set that we can work with:
each set of Counties that belong to one State, in one Country, the Country-State-Counties.
and of course, each set of Counties that belong to Country, the Country-Counties.
So, while the Relational Model provides that, that is not implemented in the RFS, and the integrity that is possible in such RFS, is limited to referencing one file at a time, not to referencing sets of data.
Relational Integrity
Now let's look at the integrity that the Relational Model provides.
In order to afford a comparison, we need that example transformed into a Relational database. Look at just the first three tables on page 2.
Let's look at the same inserts:
Here we are saying, the Identifier or Relational Key for State is ( CountryCode, StateCode ). No alarm bells re a StateCode of "NT", because the CountryCode makes it unique.
Here we are saying, the Identifier or Relational Key for County is ( CountryCode, StateCode, CountyCode ). We are unlikely to make the mistake of placing a Lee County in New York State, because when we enter it, we know it is wrong, we have to actively enter the wrong thing "NY". Whereas when it was a number, an ID field, we didn't have a clue:
To sum that up in Relational terms:
In the Record Filing Systems marketed and propagated by the post-Codd authors and "theoreticians", which has no Relational Keys; no Sets, they have only the Referential Integrity provided by SQL.
In a Relational database, which has Relational Keys and Sets, in addition to the Referential Integrity provided by SQL, we have Relational Integrity.
Answer Exceeds SO Limit
I have a second example, and further explanation, to complete this. But the answer is too long. I have to transform it into a WP file, place it in a link, etc. Tomorrow.
No doubt he intended the term ("relational integrity") to mean "100% faithfulness to the original model as devised by Codd". It is a rare property. Don't go looking for it in SQL systems. Just remember that it is not a term of art - let alone an agreed-upon one.