SQL join following foreign key: statically check t

2019-01-19 09:16发布

Often you join two tables following their foreign key, so that the row in the RHS table will always be found. Adding the join does not affect the number of rows affected by the query. For example

create table a (x int not null primary key)
create table b (x int not null primary key, y int not null)
alter table a add foreign key (x) references b (x)

Now, assuming you set up some data in these two tables, you can get a certain number of rows from a:

select x from a

Adding a join to b following the foreign key does not change this:

select a.x from a join b on a.x = b.x

However, that is not true of joins in general, which may filter out some rows or (by Cartesian product) add more:

select a.x from a join b on a.x = b.x and b.y != 42 -- probably gives fewer rows

select a.x from a join b on a.x != b.y -- probably gives more rows

When reading SQL code there is no obvious way to tell whether a join is the key-preserving kind, which may add extra columns but does not change the number of rows returned, or whether it has other effects. Over time I have developed a coding convention which I mostly stick to:

  • if a key-preserving join, use join
  • if wanting to filter rows, put the filter condition in the where clause
  • if wanting more rows, sometimes cross join for Cartesian product is the clearest way

These are usually just style issues, since you can often put a predicate into either the join clause or the where clause, for example.

My question

Is there some way to have these key-preserving joins statically checked by the database server when the query is compiled? I understand that the query optimizer already knows that a join on a foreign key will always find exactly one row in the table pointed to by the foreign key. But I would like to tag it in my SQL code for the benefit of human readers. For example, suppose the new syntax fkjoin is used for a join following a foreign key. Then the following SQL fragments will give errors or not:

a fkjoin b on a.x = b.x -- OK

a fkjoin b on a.x = b.x and b.y = 42 -- "Error, join can fail due to extra predicate"

a fkjoin b on a.x = b.y -- "Error, no foreign key from a.x to b.y"

This would be a useful check for me when writing the SQL, and also when returning to read it later. I understand and accept that changing the foreign keys in the database would change what SQL is legal under this scheme - to me, that is a desired outcome, since if a necessary FK ceases to exist then the key-preserving semantics of the query are no longer guaranteed, and I'd like to find out about it.

Potentially, there could be some external SQL static checker tool that does the work, and special comment syntax could be used rather than a new keyword. The checker tool would need access to the database schema to see what foreign keys exist, but it would not need to actually execute the query.

Is there something that does what I want? I am using MSSQL 2008 R2. (Microsoft SQL Server for the pedantic)

1条回答
Melony?
2楼-- · 2019-01-19 09:27

I realize that you are interested in indicating whether a particular join on particular columns is on a FK, or is a restriction, or perhaps is of some other case, or none of the preceding. (And it's not clear what you mean by "success" or "failure" of a join, or its relevance.) Whereas focusing on that information, as explained below, is to miss focusing on more important and fundamental things.

A base table has a "meaning" or "predicate (expression)" that is a fill-in-the-(named-)blanks statement given by the DBA. The names of the blanks of the statement are the columns of the table. Rows that fill in the blanks to make a true proposition about the world go in the table. Rows that fill in the blanks to make a false proposition about the world are left out. Ie a table holds the rows that satisfy its statement. You cannot set a base table to a certain value without knowing its statement, observing the world and putting the appropriate rows into the table. You cannot know about the world from base tables except by knowing its statement and taking present-row propositions to be true and absent-row propositions to be false. Ie you need its statement to use the database.

Notice that the typical syntax for a table declaration looks like a shorthand for its statement:

-- employee [eid] is named [name] and lives at [address] in ...
EMPLOYEE(eid,name,address,...)

You can make bigger statements by putting logic operators AND, OR, AND NOT, EXISTS name, AND condition, etc between/around other statements. If you translate a statement to a relation/SQL expression by converting

  • a table's statement to its name
  • AND to JOIN
  • OR to UNION
  • AND NOT to EXCEPT/MINUS
  • EXISTS C,... [...] to SELECT all columns but C,... FROM ...
  • AND condition to ON/WHERE condition
  • IMPLIES to SUBSETOF
  • IFF to =

then you get a relation expression that calculates the rows that make the statement true. (Arguments of UNION & EXCEPT/MINUS need the same columns.) So just as every table holds the rows satisfying its statement, a query expression holds the rows that satisfy its statement. You cannot know about the world from a query result except by knowing its statement and taking its present-row propositions to be true and absent-row propositions to be false. Ie you need its statement to compose or interpret a query. (Observe that this is true regardless of what constraints hold.)

This is the foundation of the relational model: table expressions calculate rows satisfying corresponding statements. (To the extent that SQL differs, it is literally illogical.)

Eg: If table T holds the rows that make statement T(...,T.Ci,...) true and table U holds the rows that make statement U(...,U.Cj,...) true then table T JOIN U holds the rows that make statement T(...,T.Ci,...) AND U(...,U.Cj,...) true. That is the semantics of JOIN that is important to using a database. You can always join, and a join always has a meaning, and it is always the AND of its operands' meanings. Whether any tables happen to have FKs to others just isn't particularly helpful for reasoning about updates or queries. (The DBMS uses constraints for when you make mistakes.)

A constraint expression just corresponds to a proposition aka always-true statement about the world and simultaneusly to one about base tables. Eg for C UNIQUE NOT NULL in U, the following three expressions are equivalent to each other:

  • FOREIGN KEY T (C) REFERENCES U (C)
  •     EXISTS columns other than C T(...,C,...)
    IMPLIES EXISTS columns other than C U(...,C,...)
  • (SELECT C FROM T) SUBSETOF (SELECT C FROM U)

It is true that this implies that SELECT C FROM T JOIN U ON T.C = U.C = SELECT C FROM U, ie a join on a FK returns the same number of rows. But so what? The join's meaning is still the same function of its arguments'.

Whether a particular join on a particular column set involves a foreign key is just not germane to understanding the meaning of a query.

查看更多
登录 后发表回答