Is using an IN over a huge data set a good idea?

2019-08-01 07:36发布

问题:

Let's say I have a query of the form:

SELECT a, b, c, d 
FROM table1 
WHERE a IN (
  SELECT x 
  FROM table2 
  WHERE some_condition);

Now the query for the IN can return a huge number of records. Assuming that a is the primary key, so an index is used is this the best way to write such a query?

Or it is more optimal to loop over each of the records returned by the subquery?

For me it is clear that when I do a where a = X it is clear that I just do an index (tree) traversal.
But I am not sure how an IN (especially over a huge data set) would traverse/utilize an index.

回答1:

The MySQL optimizer isn't really ready (jet) to handle this correctly you should rewrite this kind of query to a iNNER JOIN and index correctly this will be the fasted method assuming t1.a and t2.x are unique

something like this.

SELECT 
a
, b
, c
, d
FROM 
  table1 as t1
INNER JOIN
  table2 as t2
ON t1.a = t2.x
WHERE 
 t1.some_condition .... 

And make sure that t1.a and t2.x have PRIMARY or UNIQUE indexes



回答2:

Having 1 query instead of loop will be definitely more efficient (and by nature consistent , to get consistent results with loop in general you will have to use serializable transaction ). One can argue in favour of EXISTS vs IN; as far as I remember mysql generates (or at least it was true for up to 5.1)...

Efficiency of utilizing index on a depends on number and order subquery result (assuming optimizer choses to grab results from subquery first and then compare it with a). In my understanding, the fastest option is to perform merge join which requires both resultsets sorted by the same key; however it may not be possible due to different sort order. Then I guess it's optimizer decision whether to sort or to use loop join. You can rely on its choice or try using hints and see if it makes a difference.