Let's say I have a query of the form:
SELECT a, b, c, d
FROM table1
WHERE a IN (
SELECT x
FROM table2
WHERE some_condition);
Now the query for the IN
can return a huge number of records.
Assuming that a
is the primary key, so an index is used is this the best way to write such a query?
Or it is more optimal to loop over each of the records returned by the subquery?
For me it is clear that when I do a where a = X
it is clear that I just do an index (tree) traversal.
But I am not sure how an IN
(especially over a huge data set) would traverse/utilize an index.
The MySQL optimizer isn't really ready (jet) to handle this correctly you should rewrite this kind of query to a iNNER JOIN and index correctly this will be the fasted method assuming t1.a and t2.x are unique
something like this.
SELECT
a
, b
, c
, d
FROM
table1 as t1
INNER JOIN
table2 as t2
ON t1.a = t2.x
WHERE
t1.some_condition ....
And make sure that t1.a and t2.x have PRIMARY or UNIQUE indexes
Having 1 query instead of loop will be definitely more efficient (and by nature consistent , to get consistent results with loop in general you will have to use serializable
transaction ). One can argue in favour of EXISTS
vs IN
; as far as I remember mysql generates (or at least it was true for up to 5.1)...
Efficiency of utilizing index on a
depends on number and order subquery result (assuming optimizer choses to grab results from subquery first and then compare it with a
). In my understanding, the fastest option is to perform merge join which requires both resultsets sorted by the same key; however it may not be possible due to different sort order. Then I guess it's optimizer decision whether to sort or to use loop join. You can rely on its choice or try using hints and see if it makes a difference.