I have three ways to construct my query:
The first one:
select obj from table1 where condition1 and obj in (
select obj from table2 where condition2 and obj in (
select obj from table3 where condition3 and obj in (
...
)))
The second one:
select obj from table1 where condition1
and obj in (select obj from table2 where condition2)
and obj in (select obj from table3 where condition3)
...
The third one:
select table1.obj from table1
inner join table2 on table2.obj = table1.obj and table2.condition='condition2'
inner join table3 on table3.obj = table2.obj and table3.condition='condition3'
...
where table1.condition='condition1'
My questions is if these queries provide the same result and if these query are equally optimal.
I am pretty sure that the first two queries produce the same output but the second query is faster. Am not sure about the third query.
ADDED
There is also another option:
select table1.obj from table1
inner join table2 on table2.obj = table1.obj
inner join table3 on table3.obj = table2.obj
...
where
table1.condition='condition1' and
table2.condition='condition2' and
table3.condition='condition3'
Well, first, those queries don't necessarily return the same results :
The IN() removes duplicates. So, if table2 contains 50 values of x satisfying condition y=foo, query 2 will return 50x more rows than query 1. That may be what you want, or not. If x is UNIQUE in both tables, the queries will give the same results.
This is of course wrong, since the subqueries are not dependent subqueries. Anyway, it will execute the subquery once, and hash it or sort it to eliminate duplicates (as required by IN()) then use the resulting list to perform an IN-join. MySQL until recent versions did re-execute the subquery for each row, this is no longer the case. Since MySQL doesn't do hash IN joins though, it might still be much slower than a JOIN.
Although there are always exceptions, Option 3 is almost surely the best/first choice. Depending on your indexes and data distributions, the MySQL query execution planner will handle which order to pull from the tables.
In the other cases, subqueries (Options 1 and 2) are executed for every row of the outer query -- they can be terribly inefficient. So, following the previous statement, nested subqueries (Option 1), can be exponentially worse than using first order subqueries (Option 2) or normal joins (Option 3).
Note that for
INNER JOIN
s, it doesn't matter with respect to performance or functionality if the extra conditions are in theJOIN
clauses or in theWHERE
clauses. Consequently, your other option is effectively equivalent to the Option 3.I would suspect the third query would be fastest. SQL is optimized to make JOINs a fast operation.
But the only way to know with YOUR data is to try them and see.
Basicaly the way of execute a query is the following : FROM (get the whole data) => WHERE (apply constraints) =>SELECT (display results)
JOIN clause are always the best choice for joining data because in the WHERE clause, only data that matches the JOIN clauses are tested.
In the FROM, only field on the ON clause are selected and tested before getting the whole data for the WHERE and the SELECT.
In your both first examples, for each SELECT the whole content of tables is selected for the tests of the WHERE clause. It is almost the same problem for joining in the WHERE clause.
The both last examples seems to be the same. I would prefer the last one because the use of the ON clause in the third example is useful for RIGHT or LEFT JOIN but in this case, it just brings mess in the clauses'use : FROM : get the accurate resource, WHERE : apply constraints.
This explanation is very schematic but I hope it makes sense ...