I am learning the Relational Model and data modeling.
And I have some confusion in my mind regarding sub types.
I know that data modeling is an iterative process and there are many different ways to model things.
But I don't know how to choose between different options.
Example
Suppose we want to model the particles (molecule, atom, proton, neutron, electron, ...).
Let's ignore Quark and other particles for simplicity.
Since all particles of the same type behave the same, we are not going to model the individual particles.
Put it in another way, we are not going to store every Hydrogen atom.
Instead, we will store Hydrogen, Oxygen and other atom types.
What we are going to model is actually particle types and relationships between them.
I am using the word "type" carelessly.
A Hydrogen atom is an instance. Hydrogen is a type. Hydrogen is also a type of Atom.
Yes, there is a hierarchy of types involved. And we are ignoring the lowest level (individual particles).
Approaches
I can think of several approaches to model them.
1. One table (relation, entity) for each type of things (particle types).
1.1 The first approach that comes into my mind.
Proton (Proton)
Neutron (Neutron)
Electron (Electron)
Atom (Atom)
Atom_Proton (Atom, Proton, Quantity)
Atom_Neutron (Atom, Neutron, Quantity)
Atom_Electron (Atom, Electron, Quantity)
Molecule (Molecule)
Molecule_Atom (Molecule, Atom, Quantity)
1.2 Since there is only one kind of proton/neutron/electron, we can simplify it.
Atom (Atom, ProtonQuantity, NeutronQuantity, ElectronQuantity)
Molecule (Molecule)
Molecule_Atom (Molecule, Atom, Quantity)
In this simplified model, facts about Proton are lost.
2. All things in one table, with associative tables representing relationships between them.
2.1 one associative table for each relationship
Particle (Particle)
Atom_Proton(Particle, Particle, ProtonQuantity)
Atom_Neutron(Particle, Particle, NeutronQuantity)
Atom_Electron(Particle, Particle, ElectronQuantity)
Molecule_Atom (Particle, Particle, AtomQuantity)
2.2 single associative table
Particle (Particle)
ParticleComposition (Particle, Particle, Quantity)
This simplification doesn't lose anything. I think it's better.
But if there're facts that are specific to Atom_Proton/Atom_Neutron/Atom_Electron, 2.1 may be better.
2.3 combine 2.1 and 2.2
Particle (Particle)
Atom_Proton (Particle, Particle, other attributes)
Atom_Neutron (Particle, Particle, other attributes)
Atom_Electron (Particle, Particle, other attributes)
Molecule_Atom (Particle, Particle, other attributes)
ParticleComposition(Particle, Particle, Quantity, other attributes)
In this approach, common attributes about particle composition go in ParticleComposition,
while special attributes about particle composition go in special tables.
3. Use sub type tables.
3.1 A table for base type Particle, and additional tables for sub types (Atom, Molecule, ...).
Particle (Particle)
Proton (Particle, other attributes)
Neutron (Particle, other attributes)
Electron (Particle, other attributes)
Atom (Particle, other attributes)
Molecule (Particle, other attributes)
Atom_Proton (Particle, Particle, ProtonQuantity)
Atom_Neutron (Particle, Particle, NeutronQuantity)
Atom_Electron (Particle, Particle, ElectronQuantity)
Molecule_Atom (Particle, Particle, AtomQuantity)
3.2 We can also combine the Atom_XXXQuantity tables in Atom and remove Pronton/Neutron/Electron.
Particle (Particle)
Atom (Particle, ProtonQuantity, NeutronQuantity, ElectronQuantity)
Molecule (Particle, other attributes)
Molecule_Atom (Particle, Particle, AtomQuantity)
It's simpler, but information about Proton/Neutron/Electron is lost as in 1.2.
3.3 We can change the name of Molecule_Atom to make it more generic.
Particle (Particle)
Atom (Particle, ProtonQuantity, NeutronQuantity, ElectronQuantity)
Molecule (Particle, other attributes)
ParticleComposition (Particle, Particle, Quantity)
This looks like 2.2, with additional tables for sub types (Atom, Molecule).
It seems 2.2 is a special case of 3.3.
3.4 We can combine all the above approaches and get a generic model.
Particle (Particle)
Proton (Particle, other attributes)
Neutron (Particle, other attributes)
Electron (Particle, other attributes)
Atom (Particle, other attributes)
Molecule (Particle, other attributes)
ParticleComposition (Particle, Particle, Quantity, other attributes)
Atom_Proton (Particle, Particle, other attributes)
Atom_Neutron (Particle, Particle, other attributes)
Atom_Electron (Particle, Particle, other attributes)
Molecule_Atom (Particle, Particle, other attributes)
It seems that Atom_Proton, Atom_Neutron, Atom_Electron and Molecule_Atom can be thought of as sub types of ParticleComposition.
This approach is the most complex one, it contains many tables but each table has its role.
Questions
- Does any of the above designs break the rules of Relational Model?
- Which approach is the best? Does it depend on how we think about the data? Does it depend on the requirements?
If it depends on the requirements, shall we choose the simplest design at first and then make it more generic to accommodate new requirements?
Although the resulting data models share a lot of similarities, the initial design may influence the naming of the tables/columns, and the domains of the keys are different.- If we choose to use one table for each type of things, we could choose incompatible keys for Atom and Molecule, such as atom weight for Atom and molecule name for Molecule.
- If we choose to use the generic approach, we may choose a common key for all particles.
Changing Keys may have greater impact on the system, so it may not be easy to evolve from a simple design to a generic one.
What do you think?
PS: This may not be an appropriate example and the solutions may be problematic, and there may be more variations of the approaches, but you can get the point hopefully.
If you have better designs, please share with me.
Update 1
What is the data to model?
Initially, I was trying to model the particles because
- I think there are sub-typing relationships between them, which is exactly what I am looking for.
- They are well-understood (?) by people.
- It is a good example of how people understand the world.
Here is the picture in my mind.
I didn't state this clearly because I was not very clear about what I was trying to model either.
Firstly I thought Atom is the parent of Proton/Neutron/Electron, and Molecule is the parent of Atom.
Then I realized that this is about composition, not about subtyping, and not about Type Hierarchy.
Types
I have been thinking about types for a while, as well as grouping and classification.
Here is a quote from "SQL and Relational Theory":
So what is a type, exactly? In essence, it’s a named, finite set of values ─ all possible values of some specific kind: for example, all possible integers, or all possible character strings, or all possible supplier numbers, or all possible XML documents, or all possible relations with a certain heading (and so on).
People coined the name "Integer" to represent the set of integer values.
Actually, people coined concepts and names to identify things, grouping things so that we can understand/model the world.
Proton is a set of real protons, Hydrogen is a set of hydrogen atoms, and so on.
In this sense, the real particles stay at the lowest level of the type hierarchy.
I was trying to model all the particles at first, but then I was stuck because
- I couldn't think of an appropriate key to identify each real particle;
- there are too many of them to store in a database.
So I decided to ignore the real particles and model the types instead.
When we say "a molecule is composed of atoms", it means "a real H2O molecule is composed of two real Hydrogen atoms and one Oxygen atom", it also means "any (type of) molecule is composed of (some types of) atoms".
Instead of stating every fact about the real particles, we can just state facts about the particle types.
That is the benefit we get by grouping things and coined names (types).
Particle Type Hierarchy As Sets
The hierarchy can be translated into set definitions.
Second level - types above the real particles:
S_proton = { p | p satisfied the definition of a proton }
S_neutron = { n | n satisfied the definition of a neutron }
S_electron = { e | e satisfied the definition of an electron }
S_hydrogen = { h | h satisfied the definition of a hydrogen }
S_oxygen = { o | o satisfied the definition of an oxygen }
S_h2o = { w | w satisfied the definition of a h2o }
S_o2 = { o | o satisfied the definition of a o2 }
Higher Levels
Using the terminology of Set Theory, type A is a subtype of B if A is a subset of B.
I first thought we could define the Atom type as:
S_atom = S_hydrogen union S_oxygen union ...
However, the sets are relations and the elements are tuples, so the union doesn't work if tuples in the relations are incompatible.
The approaches that use subtype tables solve the problem and models the subset relationship.
But in the subtyping approach, Atom is still in the second level.
Higher level types are defined as sets of sets.
S_atom = { S_hydrogen, S_oxygen, ... }
S_molecule = { S_h2o, S_o2, ... }
S_particle = { S_proton, S_neutron, S_electron, S_atom, S_molecule }
which means Particle is the type of Atom, and Atom is the type of Hydrogen.
This way, the relationships between particles can be represented at a high level.
The new data model
4. Treat types as a hierarchy of types
ParticleType (ParticleType, Name)
ParticleTypeHierarchy (ParticleType, ParentType)
ParticleComposition (PartileType, SubParticleType, Quantity)
Sample data:
ParticleType | ParticleType | Name | |--------------+----------| | Particle | Particle | | Proton | Proton | | Neutron | Neutron | | Electron | Electron | | Atom | Atom | | Molecule | Molecule | | H | Hydrogen | | O | Oxygen | | H2O | Water | | O2 | Oxygen | ParticleTypeHierarchy | ParticleType | ParentType | |--------------+------------| | Proton | Particle | | Neutron | Particle | | Electron | Particle | | Atom | Particle | | Molecule | Particle | | Hydrogen | Atom | | Oxygen | Atom | | H2O | Molecule | | O2 | Molecule | ParticleComposition | PartileType | SubParticleType | Quantity | |-------------+-----------------+----------| | H | Proton | 1 | | H | Electron | 1 | | He | Proton | 2 | | He | Neutron | 2 | | He | Electron | 2 | | H2O | H | 2 | | H2O | H | 2 | | H2O | O | 1 | | CO2 | C | 1 | | CO2 | O | 2 |
For comparison, this is the sample data for a subtype table approach.
Particle | ParticleId | ParticleName | |------------+----------------| | H | Hydrogen | | He | Helium | | Li | Lithium | | Be | Beryllium | | H2O | Water | | O2 | Oxygen | | CO2 | Carbon Dioxide | Molecule | MoleculeId | some_attribute | |------------+----------------| | H2O | ... | | O2 | ... | | CO2 | ... | Atom | AtomId | ProtonQuantity | NeutronQuantity | ElectronQuantity | |--------+----------------+-----------------+------------------| | H | 1 | 0 | 1 | | He | 2 | 2 | 2 | | Li | 3 | 4 | 3 | | Be | 4 | 5 | 4 | ParticleComposition | ParticleId | ComponentId | Quantity | |------------+-------------+----------| | H2O | H | 2 | | H2O | O | 1 | | CO2 | C | 1 | | CO2 | O | 2 | | O2 | O | 2 |
sub-atom
These particle types are defined by people and people keep defining new concepts to model new aspects of the reality.
We can define "sub-atom" and the hierarchy will look like:
Approach 4 can accommodate this type hierarchy change more easily.
Update 2
The facts to record
- There are different types of particles in the world: protons, neutrons, electrons, atoms, molecules.
- Atoms are composed of protons, neutrons, and electrons.
- Molecules are composed of atoms.
- There are many different types of atoms: Hydrogen, Oxygen, ....
- There are many different types of molecules: H2O, O2, ....
- A Hydrogen atom is composed of one proton and one electron; ...
- A H2O molecule is composed of two Hydrogen atoms and one Oxygen atom; ...
- Different types of particles may have special properties, e.g. An atom has atom weight, etc.
- ...
When cut down to its bare essence, the relational model of data has no more than exactly one "rule" : all information in the database must be represented as values of attributes in tuples in relations.
All of your "alternatives" potentially comply to that rule, provided :
- that each attribute has an associated data type,
- and that each tuple in each relation in the database will always have a value for each of its attributes,
- and that value is a value that is a member of the data type associated with that attribute.
EDIT : you have also failed to provide any detail of what the precise nature is of the facts you want to make a record of in your system.
EDIT 2 : first comment by Walter M. still applies. Your facts seem to state things at different levels (which in the typical use cases will be notably distinct) :
"6. A Hydrogen atom is composed of one proton and one electron"
After a small rewrite to eliminate the 'AND' therein :
"6. A <atom_id> atom contains <qty> <subatomicparticletype>"
This one looks like something that would go into your database (if your use case is as typical/mundane as could be supposed) :
A 'H' atom contains 1 proton
A 'H' atom contains 1 electron
A 'H' atom contains 1 neutron
(Note how eliminating the 'AND' involved splitting the conjunction into "atomic" parts (pun intended).)
From this one, we can start wondering what to do about the <subatomicparticletype>. If your use case is such that the existence of proton/neutron/electron is just a given, and it will never change, then you can simply use a data type for it, and modeling it will not involve more than identifying the type so that your model's readers will know the intended value set. If, however your use case is such that, say, you are experimenting to try and find a completely new model of chemistry, in which there could also be foobarons alongside protons [or their existence could be removed again for the sake of experimentation], then you'd have to include a table that says "<subatomicparticletype> is part of my model of atoms".
Furthermore, you'd then also have to include a rule in your model that any <subatomicparticletype> that is claimed to be part of an atom, must indeed be one that is part of your model of atoms. In SQLspeak : you'd need an FK from your ATOM_CONTAINS_PARTICLE table to this EXISTING_PARTICLES table.
In a sense, the declaration of this rule is like your
"2. Atoms are composed of protons, neutrons, and electrons."
But note that you won't have a table in your own database that says this kind of thing. Instead, by declaring the FK to the system, this particular statement will be made in the catalog.
You need to make proper distinction between the type of statements that directly state things that are within the domain of interest (those end up being entities/classes/... in your conceptual models and most likely tables in your database) and the type of statements that state things about the domain of interest (like your FK rule).
(In use cases where the domain itself is highly abstract, the line between the two may be extremely thin or even nonexistant.)
Preliminary
Good question, very thoughtful for a learner. I think what you are really after is a discussion, in order to obtain clarity, and this is a Data Modelling exercise.
I understand your progression up to and including 3.3. What, how, do you get 3.4 (after the step-wise progression to 3.3) ? To me, Combine all the above does not equal Generic.
Rather than following your progression, and erecting a model for each step, let me respond with a TRD for the relevant steps, per your discussion.
TRD Only Tables, which are identified by Keys, and Relationships are relevant at this stage, I think you are well aware of the Attributes, if any, and which Keys they would be deployed with. After you achieve a stable TRD, then you can expand it to full DM.
After erecting a model than it a progression from the previous one, and upon evaluation, if it is clear that it loses information, it can be safely discarded. There is a value is contemplating such models, so the step is not incorrect. But the continued discussion of it is a waste. I believe I demonstrated that in the previous question.
Consider this set of Table Relation Diagrams.
1.x
From my perspective, A First is would be the first reasonable TRD that is worth contemplation.
I don't see how or why Proton/Neutron/Electron are Independent tables. They do not exist on their own, their weights; etc are fixed. They exist only in the context of an Atom.
Since every Atom comprises at least one Proton/Neutron/Electron, the Proton/Neutron/Electron columns can be deployed in Atom. Not drawn. Later.
2.x
Your progression is fine, except for one glaring error.
No. Common attributes about particle go in Particle. Attributes that are specific to the relationship (ie. not common) go in ParticleComposition. And then there are no "special attributes about particle composition", no "special tables".
3.x
Consider B Subtype. Your [3.1] is mostly correct, except for:
I don't see how Particle has children such as Proton/Neutron/Electron. Only an Atom has that.
I don't see how Particles are related to other Particles (ie. what is that ?). For the data discussed, a Molecule is made up of Atoms; An Atom is made up of Proton/Neutron/Electron; and a Particle is either a Molecule xor an Atom (Exclusive Subtype).
Please correct me if that is not correct.
Refer to Subtype Document for full details on that subject.
That can be C Reduced, as you have stated. This holds the notion that Proton/Neutron/Electron information is fixed per Atom: that there is one entry for each. Eg. each shell/energy level is not differentiated; zero is acceptable for Neutrons (instead of Null).
3.3
If it were fully D Normalised: the Atom always has at least a Proton entry; the Neutron entry is optional; and each Shell/energy level is differentiated.
Note the difference in the Predicates.
Note that although Reduction is a valid technique, it does not equate to Normalisation.
3.4
That appears to be the sum total of everything, laid out flat, or a flattened view (derived relations, a perspective, the result set). As such it is fine, for understanding. But if you proposed that as a set of tables, then it is horribly incorrect, due to various Normalisation errors. Which, if corrected, would progress to [3.3] and my [D Normalised].
Question
All of them except [3.3] break a number of rules. Mostly they are in the category of Normalisation errors. There would be associated identification errors, if you were to have given a full model, or CREATE TABLE statements.
But that does not matter if the context is data modelling exercise, for understanding. If this exercise was serious, then the paragraph above stands.
This section is presented in accordance with the SO guidelines, specifically: correct misinformation whenever you see it. I did comment on the subject post, but they keep disappearing. Thus I have placed it here.
That is one of the rules, yes, but the enclosing statement is patently false.
First, there are many essential or first-order rules in the Relational Model. From memory, I would say about forty.
Second, there are many second-order rules, ones that are logically implied by the first-order rules.
People who have technical qualifications and experience, who can understand the RM, and who follow the spirit and intent, follow all of them.
Others may not recognise some of the first-order rules, or most of the implied rules.
And there are, as evidenced from the books that allege to be about the RM, yet others, people who actively subvert and diminish the RM. They ignore the second-order rules, and worse, they use pharisaic "logic" to undermine the first-order rules.
Here, Erwin, who is well-known for his efforts regarding the RM on comp.databases.theory and TTM, reduces the RM to one pithy rule, and thus undermines the full set of rules, and the RM itself. Specifically in answer to your question, which if not for my response, would lead readers to believe that the RM is what he makes it out to be: just one rule, that everything, relational as well as non-relation, "satisfies".
The Relational Model is freely available, you can read it for yourself. Let me know if you would like a copy. The caveat is, the terminology is out-dated, and needs to be explained.
Second, even if one were to boil it down to one rule (impossible, too reductionist) or the most important rule (possible, but demeaning), that rule would not be it. That is one of the forty or so first-order rules, but certainly not ranked close to the top.
However, I grant that other people may have a different ranking, to suit their own purposes.
What people who understand the RM do discuss, as the main difference (not rule) between the RM and its predecessors, is this:
It was the first to have a complete mathematical definition (which forms its basis, and everything in it flows from that).
Whereas the predecessors related records using physical Record IDs, the RM demands (a) Logical Keys, made up from the data, and (b) relating rows (not records) by those Logical Keys.
It must be mentioned, that is the basis upon which systems that are characterised by Record IDs in every file, declared as "primary key", are completely non-relational, a regression back to pre-1970 ISAM Record Filing Systems, the very thing that the RM made obsolete. Notice also, how those primitive systems can be made to appear "relational", because by schizophrenic "logic", they "satisfy" the one quoted rule. Honest logic destroys such nonsense.
Such Record ID based systems have become the noram in the lower end of the industry precisely due to misinformation. Hence my willingness to correct it.
End misinformation correction section.
Formal Data Modelling, including Relational Normalisation. The method, the science, the principle, not the fragments of NF definitions.
I do not perceive the proposals to be different approaches, rather it is laying out all your thoughts in one single modelling exercise. And the point where the model starts to take a serious, feasible shape is [3.3].
Of course. Your marriage will succeed or fail based on the perception you have about your wife, because that perception is the seat of all your actions. The model will succeed or fail based on your perception of the data.
One of the great things about the Relational Model is that it teaches us to view (perceive, think about, model) the data, as data, and nothing but data. For one thing, that forms the Logical Key concept.
The first answer is, no, it should not depend on requirements. It should consider the data, the scope of which is limited to the enterprise (requirement, yes, but not the functional requirement), and only the data.
And of course, for reasons that I have detailed elsewhere, the data model should match the real world, it should not be limited the the functional need agaist the data.
The massive error, the common reason, for failure in the OO/ORM model, is that it perceives the data from the tiny lens of the OO/ORM model. It fails to separate Data vs Process, and it treats data as a mere "persistence" slave of the objects. There are many other errors in that model, which I will not enumerate here, the point is, they start from the position of the requirement, and ignore the data.
The second answer is, a project does not get commissioned until the requirement is set, the reality if funds are requirement-based. So the mature project leader makes sure that the requirement contains enough justification to analyse and model the data, as data, separate to the functions.
You could, but that will cost an awful lot. The mature sequence is to get the data model right, as early as possible.
If the data model matches the real world, when changes and additions come up, it is easy to extend. Conversely, if the data model was the minimum for the functional requirement, or if it does not match the real world, then changes will be difficult and costly.
Of course.
That would be a horrible error. Never place something in a container that does not match the label. Never place two different things in one container (which has one label). The correct method would be to use a common identifier Name (which is Atom- or Molecule- or Particle-name), and to use Subtypes.
Only if there is one. And if there is not, that stands as a sign that the entities are not the same, that a generic model cannot be used.
Well, the idea is to choose data items that are stable (not static) to form the key. And yes, Key design is an important aspect of the modelling exercise. If you follow the Relational Model, the Keys form the logical structure of the database. Domain is very important (I think you realise that). And yes, it is costly to change.
Which brings us back to the main point. That is precisely the reason why the Keys have to be modelled and chosen correctly, for each table, as well as for all its children.
Update 1 & 2
I noticed your two Updates just now. This is not a full response, just a very short one for now.
But now, given your hierarchy diagrams, and sample data (thank you), I realise that what I thought you meant, and what you meant, are two different things. Consider the Updated TRD & Hierarchy:
Your Particle is the set of Molecules plus the set of Atoms plus the set of subatomic particles.
That is incorrect
There is an hierarchy, yes, but thus far, it exists in the sequence of tables, not as an hierarchy within one table.
Stated otherwise, the two sets (Atoms, Molecules) are discrete, each has their own set of components, which are different. There is no set that includes everything (except the theoretical universal set).
The updated Table Relation model is E Normalised • Update 2. The Subtypes have been removed, along with Particle. It supplies all the requirements stated in Update 2. Note the updated Predicates.
Your hierarchy diagram is incorrect.
Your error, is that you have combined the hierarchy of Classifiers (the structure, the container) with the data (the instances of Classifiers; the content). You can't do that. You need two separate diagrams, one for the container, and a second for the content.
This is a typical error of the OO/ORM mindset. Failure to observe the Scientific Principle to separate Data vs Process. The exact same error detailed in my Response to Hidders, in the previous question. Results is complex objects, that never actually work.
So your hierarchy diagram is illegal, it is two completely different diagrams combined into one.
F Hierarchy (Classification) depicts that, and only that.
G Hierarchy (Sample Data) illustrates that, and only that.
There is a difference in style between the way you depict hierarchies (Organisation Chart) and the way I depict them (Explorer). One ends up being very wide, the other is more compact. I think you can figure it out
You had some clarity at the end of the previous question. The novel notion of Type in that poisonous book has got you completely confused. This problem, these issues, have nothing to do with Type.
More words are called for, I will respond more fully as time permits.
I like Fowler's treatment of Class Table Inheritance and Single Table Inheritance. You have touched on both of these designs here. Each of them has its uses and its drawbacks. You can use these as search terms and you'll get a lot to read. Some of it is worthwhile. There are even a couple of tags here in SO with those names.
I'm not sure about today, but subtypes were often omitted from Database 101 courses back some 20 years ago, and it was something that every database builder ran into as soon as they got into "the real world".