Inner Join vs Natural Join vs USING clause: are th

2019-01-08 01:10发布

问题:

Imagine I have two simple tables, such as:

CREATE TABLE departments(dept INT PRIMARY KEY, name);
CREATE TABLE employees(id PRIMARY KEY, fname, gname,
    dept INT REFERENCES departments(dept));

(simplified, of course).

I could have any of the following statements:

SELECT * FROM employees e INNER JOIN departments d ON e.dept=d.dept;
SELECT * FROM employees e NATURAL JOIN departments d;
SELECT * FROM employees e JOIN departments d USING(dept);

A working example can be found here: SQL Fiddle: http://sqlfiddle.com/#!15/864a5/13/10

They all give nearly the same results — certainly the same rows.

I have always preferred the first form because of its flexibility, readability and predictability — you clearly define what is connected to what.

Now, apart from the fact that the first form has a duplicated column, is there a real advantage to the other two forms? Or are they just syntactic sugar?

I can see the disadvantage in the latter forms is that you are expected to have named your primary and foreign keys the same, which is not always practical.

回答1:

Now, apart from the fact that the first form has a duplicated column, is there a real advantage to the other two forms? Or are they just syntactic sugar?

TL;DR NATURAL JOIN is used in a certain style of relational programming that is simpler than the usual SQL style. (Although when embedded in SQL it is burdened with the rest of SQL query syntax.) That's because 1. it directly uses the simple operators of predicate logic, the language of precision in engineering (including software engineering), science (including computer science) and mathematics, and moreover 2. simultaneously and alternatively it directly uses the simple operators of relational algebra.

The common complaint about NATURAL JOIN is that since shared columns aren't explicit, after a schema change inappropriate column pairing may occur. And that may be the case in a particular development environment. But in that case there was a requirement that only certain columns be joined and NATURAL JOIN without PROJECT was not appropriate. So these arguments assume that NATURAL JOIN is being used inappropriately. Moreover the arguers aren't even aware that they are ignoring requirements. Such complaints are specious. (Moreover, sound software engineering design principles lead to not having interfaces with such specificiatons.)

Another related misconceived specious complaint from the same camp is that "NATURAL JOIN does not even take foreign key relationships into account". But any join is there because of the table meanings, not the constraints. Constraints are not needed to query. If a constraint is added then a query remains correct. If a constraint is dropped then a query relying on it becomes wrong and must be changed to a phrasing that doesn't rely on it that wouldn't have had to change. This has nothing to do with NATURAL JOIN.


You have described the difference in effect: just one copy of each common column is returned.

From Is there any rule of thumb to construct SQL query from a human-readable description?:

It turns out that natural language expressions and logical expressions and relational algebra expressions and SQL expressions (a hybrid of the last two) correspond in a rather direct way.

Eg from Codd 1970:

The relation depicted is called component. [...] The meaning of component(x, y,z) is that part x is an immediate component (or subassembly) of part y, and z units of part x are needed to assemble one unit of part y.

From this answer:

Every base table has a statement template, aka predicate, parameterized by column names, by which we put a row in or leave it out.

Plugging a row into a predicate gives a statement aka proposition. The rows that make a true proposition go in a table and the rows that make a false proposition stay out. (So a table states the proposition of each present row and states NOT the proposition of each absent row.)

But every table expression value has a predicate per its expression. The relational model is designed so that if tables T and U hold rows where T(...) and U(...) (respectively) then:

  • T NATURAL JOIN U holds rows where T(...) AND U(...)
  • T WHEREcondition holds rows where T(...) AND condition
  • T UNION CORRESPONDING U holds rows where T(...) OR U(...)
  • T EXCEPT CORRESPONDING U holds rows where T(...) AND NOT U(...)
  • SELECT DISTINCTcolumns to keepFROM T holds rows where
    THERE EXISTS columns to drop SUCH THAT T(...)
  • etc

Whereas reasoning about SQL otherwise is... not "natural":

An SQL SELECT statement can be thought of algebraically as 1. implicitly RENAMEing each column C of a table with (possibly implicit) correlation name T to T.C, then 2. CROSS JOINing, then 3. RESTRICTing per INNER ON, then 4. RESTRICTing per WHERE, then 5. PROJECTing per SELECT, then 6. RENAMEing per SELECT, dropping T.s, then 7. implicitly RENAMEing to drop remaining T.s Between the T.-RENAMEings algebra operators can also be thought of as logic operators and table names as their predicates: T JOIN ... vs Employee T.EMPLOYEE has name T.NAME ... AND .... But conceptually inside a SELECT statement is a double-RENAME-inducing CROSS JOIN table with T.Cs for column names while outside tables have Cs for column names.

Alternatively an SQL SELECT statement can be thought of logically as 1. introducing FORSOME T IN E around the entire statement per correlation name T and base name or subquery E, then 2. referring to the value of quantified T by using T.C to refer to its C part, then 3. building result rows from T.Cs per FROM etc, then 4. naming the result row columns per the SELECT clause, then 4. leaving the scope of the FORSOMEs. Again the algebra operators are being thought of as logic operators and table names as their predicates. Again though, this conceptually has T.C inside SELECTs but C outside with correlation names coming and going.

These two SQL interpretations are nowhere near as straightforward as just using JOIN or AND, etc, interchangeably. (You don't have to agree that it's simpler, but that perception is why NATURAL JOIN and UNION/EXCEPT CORRESPONDING are there.) (Arguments criticizing this style outside the context of its intended use are specious.)

USING is a kind of middle ground orphan with one foot in the NATURAL JOIN camp and one in the CROSS JOIN. It has no real role in the former because there are no duplicate column names there. In the latter it is more or less just abbreviating JOIN conditions and SELECT clauses.

I can see the disadvantage in the latter forms is that you are expected to have named your primary and foreign keys the same, which is not always practical.

PKs (primary keys), FKs (foreign keys) & other constraints are not needed for querying. (Knowing a column is a function of others allows scalar subqueries, but you can always phrase without.) Moreover any two tables can be meaningfully joined. If you need two columns to have the same name with NATURAL JOIN you rename via SELECT AS.



回答2:

Typing JOIN performs an INNER JOIN by default. So:

SELECT * FROM employees e INNER JOIN departments d USING(dept);

is equivalent to

SELECT * FROM employees e JOIN departments d USING(dept);

and you will have only one dept column in the result.

In the same way

SELECT * FROM employees e INNER JOIN departments d ON e.dept=d.dept;

is equivalent to

SELECT * FROM employees e JOIN departments d ON e.dept=d.dept;

but you will have a duplicate dept columns in the result.

INNER JOIN is easier to read, especially if your query have other join types (LEFT or RIGHT or ..) included in it.

A NATURAL JOIN assumes the same-named columns in both table match. So you cannot do a NATURAL JOIN if for instance in employees table your join column is named "department" and in your departments table your join column is named "dept"



回答3:

A major advantage of NATURAL JOIN is it is the only SQL join that doesn't generate 'duplicate columns'.

If all join types other than NATURAL JOIN were removed from the SQL language, it would still be relationally complete. NATURAL JOIN is the only join type you need.

However, NATURAL JOIN wasn't introduced into SQL until SQL92. In the early days of SQL, the language designers chose other join types and as a consequence had to find a way of dealing with 'duplicate columns'.

Consider the result of

employees e CROSS JOIN departments d

involves two columns named dept because each table has a column names dept i.e. 'duplicate columns'.

The chosen solution to the 'duplicate columns' problem was range variables, e and d in the above example. [Most SQL folk call them 'table aliases', which is cnfusing because a range variable represents a row, rather than a table). The SQL standard calls them 'correlation names', though it is not clear what 'correlation' means in context. Chris Date and LINQ call them range variables and so do I!] When using the 'legacy' join types and omitting range variables, a range variable name equal to the table's name will be used (perhaps the root of the 'table aiias' fallacy?).

Using the OP's example:

SELECT * FROM employees e INNER JOIN departments d ON e.dept=d.dept;

the projection SELECT * will containt 'duplicate columns'. The names of the columns and their ordering in the result is one of those things the SQL standard doesn't attempt to address; there were already too many diverging implementations to attempt to standardize. In SQL Server I get duplicate names [sigh].

Probably for these reasons, most SQL folks say, "Never use SELECT *" etc. The usual approach is to specify every column and 'project away' any duplicates. [I think a SELECT ALL BUT <set of columns> feature would be nice but I think the SQL folk have won the day.]

When using NATURAL JOIN, there are no 'duplicate columns' to 'project away' and no join predicates, therefore range variables are not needed. So the OP's example should be rewritten as:

SELECT * FROM employees NATURAL JOIN departments;

Although it is true that if one exclusively uses NATURAL JOIN and shuns all other join types then range variables are no longer needed, SQL sometimes still requires them. Using @philipxy's scenario, "If you need two columns to have the same name with NATURAL JOIN you rename via SELECT AS" - if your SQL implementation lacks common table expressions and you are forced to do the rename in a derived table then SQL requires the derived table to be assigned a range variable even though you will never refer to it again!.

Lwt's say each table in the OP's example had a pointless guid column which one needs to 'project away' for NATURAL JOIN purposes:

SELECT *
  FROM ( SELECT dep, name, FROM employees ) e 
       NATURAL JOIN ( SELECT dept, fname, gname departments ) d;

In this case, without the range variables e and d the SQL would not be valid (you'd get a parse error). The SQL implementation doesn't need the range variables (and even if it did it could generate placeholders internally!) but the SQL language requires them because of SQL's 'shackles of compatibility' where no feature is ever removed or deprecated.



回答4:

From Oracle documentation:

A NATURAL JOIN is a JOIN operation that creates an implicit join clause for you based on the common columns in the two tables being joined. Common columns are columns that have the same name in both tables.

A NATURAL JOIN can be an INNER join, a LEFT OUTER join, or a RIGHT OUTER join. The default is INNER join.

The clause

TableA JOIN tableB USING(column)

is, as you noted, simply syntactic sugar for

TableA JOIN tableB ON tableA.column = tableB.column


回答5:

NATURAL JOIN is not as widely supported and neither is JOIN USING (i.e. not in SQL Server)

There are many arguments for NATURAL JOIN being a bad idea. Personally I think that not explicitly naming things such as joins is inviting disaster.

For example if you add a column in a table without realising that it happens to fit a 'natural join', you can have unexpected code failures when a natural join suddenly does something completely different. You'd think adding a column wouldn't break anything but it can break badly written views and a natural join.

When you're building a system you should never allow these kinds of risks to creep in. It's the same as creating views across multiple tables without a table alias on every column and using insert without a column list.

For those reasons, if you are just learning SQL now, get out of the habit of using these.



标签: sql join