Efficient way to select records missing in another

2020-04-14 00:50发布

问题:

I have 3 tables. Below is the structure:

  • student (id int, name varchar(20))
  • course (course_id int, subject varchar(10))
  • student_course (st_id int, course_id int) -> contains name of students who enrolled for a course

Now, I want to write a query to find out students who did not enroll for any course. As I could figure out there are multiple ways to fetching this information. Could you please let me know which one of these is the most efficient and also, why. Also, if there could be any other better way of executing same, please let me know.

db2 => select distinct name from student inner join student_course on id not in (select st_id from student_course)

db2 => select name from student minus (select name from student inner join student_course on id=st_id)

db2 => select name from student where id not in (select st_id from student_course)

Thanks in advance!!

回答1:

The subqueries you use, whether it is not in, minus or whatever, are generally inefficient. Common way to do this is left join:

select name 
from student 
left join student_course on id = st_id
where st_id is NULL

Using join is "normal" and preffered solution.



回答2:

The canonical (maybe even synoptic) idiom is (IMHO) to use NOT EXISTS :

SELECT *
FROM student st
WHERE NOT EXISTS (
  SELECT *
  FROM student_course
  WHERE st.id = nx.st_id
  );

Advantages:

  • NOT EXISTS(...) is very old, and most optimisers will know how to handle it
  • , thus it will probably be present on all platforms
  • the nx. correlation name is not leaked into the outer query: the select * in the outer query will only yield fields from the student table, and not the (null) rows from the student_course table, like in the LEFT JOIN ... WHERE ... IS NULL case. This is especially useful in queries with a large number of range table entries.
  • (NOT) IN is error prone (NULLs), and it might perform bad on some implementations (duplicates and NULLs have to be removed from the result of the uncorrelated subquery)


回答3:

Using "not in" is generally slow. That makes your second query the most efficient. You probably don't need the brackets though.



回答4:

Just as a comment: I would suggest to select student Id (which are unique) and not names.

As another query option you might want to join the two tables, group by student_id, count(course_id) having count(course_id) = 0.

Also, I agree that indexes will be more important.