Modeling multiple many to many relationships betwe

2019-05-14 07:59发布

问题:

When two tables have a many-to-many relationship or when a table has a many-to-many relationship with itself we can model that with a junction table.

However the type of the relationship can expend beyond the original type, for example:

Requirements: we have users. each user can have 0 or more other users as friends

Solution: a users table and a junction table called "user_user".

Then we discover a new requirement: some friendships are romantic or in other words two users can be connected in a different way than friendship.

  • Solution a: add a column to the junction table that contains the type of friendship (friend | partner | ex-friend) and so on.

  • Solution b: rename the 'user_user' table to 'friendships' (that would have been a better name to start with with this approach) and create a new table called romances, also connecting a user to a user.

Then we discover yet another new requirement: users can owe other users money. Again we can piggyback on the same junction table or create a new one called 'debts'. This time my intuition is 100% to have a separate table.

Every time we discover a new type of relationship or a sub-type of a know relationship we can add a column to the one junction table or create a new junction table.

My question is, what is a good rule to decide when a new table must be created?

Is it when we must have more then a single row per ordered pair? For example if past relationships are never deleted so if two users have been friends in the past for two years, lost contact, and then re-befriended we want to have the old row as well with a start and end date, but having two rows makes the other column information duplicated (old row shows a user owes money, the new one doesn't).

Is it when the extra column is not logically a definition of a single type of relationship?

OK:start of friendship date is related directly to the friendship 1 to 1

Not OK: column for how much money one user owns another user can be 1 to 1 for the relationship but logically is not a description of the friendship.

If we know in advance that two tables are going to have a lot of many to many relationship, is it a reason to plan for many junctions tables or to make a single junction table that is more flexible (even if the type of the connection is not arbitrary)?

回答1:

It's up to you. You can define a "friendship" as distinct from "romantically involved" and so need two junction tables. Or you can define "romantically involved" as a special case of "friendship" and so need one table. I would lean toward the latter as simpler and less confusing. (What if the same two people are listed as friends and also as lovers?)

However, "owes money to" is different. If A is friends with B, then B is friends with A. But if A owes money to B, it is false to say that B owes money to A. So not only are you defining a different kind of relationship, but the positioning of which field contains the reference to A and which contains the reference to B becomes significant. For this then, it would be best to have a separate junction table.



回答2:

TL;DR Tables represent application/business relationships/associations. As in the Entity-Relationship Model. Ie relations. As in the Relational Model. Every time we are interested in a distinct relationship, we consider a new ("junction"/"association") table. Normalization tells us when and how to decompose a relationship/table into others or combine some into one.


We identify a relationship by some predicate, ie statement template, that a row makes into a proposition, ie statement, that says something we are interested in about our application. The rows that make true statement from a table's predicate go in the table.

Employee(e, n, ...) -- employee identified by E is named N and ...
Manages(e, m) -- employee M manages employee E

You cannot know what a table says about the application or set it according to the application without knowing its predicate! When you state the "cardinality" of a "relationship" you have to have identified what relationship/association/predicate you are talking about first.

(Unfortunately "relationship" gets used by many so-called ER methods & products to mean foreign key of a table.)

We can combine multiple tables into one or replace a table by others when we can reconstruct the original(s) from the new. Here we could combine the tables above into

-- employee identified by E is named N and ... AND employee M manages employee E
EmployeeM(e, n, ..., m)

That happens to always be the JOIN of Employee & Manages, because the predicate of a JOIN is the AND of the predicates of its arguments. But because E is the set of common columns & it is unique in one of them, we can reconstruct the originals via projections of the new.

That would be a reasonable design when an employee can only have exacty one manager. But if an employee could have one or more managers then that table, although it would still contain the rows that satisfy that predicate and allow reconstructing the originals, would exhibit update anomalies, so we would prefer the originals. And if an employee could have zero or one or zero or more managers then we could still use that table, but not only that table, because it can't tell us about employees with no manager(s). (Which the original design does.) We'd also need a table like Employee but with the different predicate employee identified by E is named N and ... AND E has no manager. Or instead of those two tables we could use just a table like EmployeeM but that allows nulls via the different predicate employee identified by E is named N and ... AND ( employee M manages employee E OR M IS NULL AND E has no manager ).

Read about the original Chen ER model/method, where there are entity types, icons & tables and relationship types, icons & tables. Every relationship type is clearly diagrammed as a diamond. Every line is clearly representing the participation by an entity type (possibly an associative entity type) in a relationship type, ie is clearly representing a FK between tables. Whereas many so-called ER methods do not even make distinctions between entities and relationships. (This doesn't need to be good or bad, but their presentations tend not to explain how to model properly.)

The relational model doesn't care about entities and relationships per se. Its tables are just relationships on values. ER and so-called ER methods make an unnecessary and limiting arbitrary distinction between entities and relationships. Naturally, a table can still say what it does about the application as in an ER model because subrows of values can identify entities. See this answer.

Normalization tells you when and how to decompose a relationship's predicate/table into smaller ones (that are projections of it, that AND/JOIN back to it), and it tells you when & how to combine multiple relationships' predicates/tables into one (via AND/JOIN, using projection to reconstruct).

PS There is nothing wrong with duplicated values per se. What can be a problem is when multiple rows imply the same thing about the application by being in or not in tables according to your choice of predicates. Not all such redundancy is wrong; it's just uncontrolled redundancy that is wrong. In general we want to have designs that are as much in 5NF as possible. (Ie reducing update anomalies, predicate complexity and constraints.)