Which of these queries is the faster?
NOT EXISTS:
SELECT ProductID, ProductName
FROM Northwind..Products p
WHERE NOT EXISTS (
SELECT 1
FROM Northwind..[Order Details] od
WHERE p.ProductId = od.ProductId)
Or NOT IN:
SELECT ProductID, ProductName
FROM Northwind..Products p
WHERE p.ProductID NOT IN (
SELECT ProductID
FROM Northwind..[Order Details])
The query execution plan says they both do the same thing. If that is the case, which is the recommended form?
This is based on the NorthWind database.
[Edit]
Just found this helpful article: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
I think I'll stick with NOT EXISTS.
It depends..
would not be relatively slow the isn't much to limit size of what the query check to see if they key is in. EXISTS would be preferable in this case.
But, depending on the DBMS's optimizer, this could be no different.
As an example of when EXISTS is better
They are very similar but not really the same.
In terms of efficiency, I've found the left join is null statement more efficient (when an abundance of rows are to be selected that is)
I always default to
NOT EXISTS
.The execution plans may be the same at the moment but if either column is altered in the future to allow
NULL
s theNOT IN
version will need to do more work (even if noNULL
s are actually present in the data) and the semantics ofNOT IN
ifNULL
s are present are unlikely to be the ones you want anyway.When neither
Products.ProductID
or[Order Details].ProductID
allowNULL
s theNOT IN
will be treated identically to the following query.The exact plan may vary but for my example data I get the following.
A reasonably common misconception seems to be that correlated sub queries are always "bad" compared to joins. They certainly can be when they force a nested loops plan (sub query evaluated row by row) but this plan includes an anti semi join logical operator. Anti semi joins are not restricted to nested loops but can use hash or merge (as in this example) joins too.
If
[Order Details].ProductID
isNULL
-able the query then becomesThe reason for this is that the correct semantics if
[Order Details]
contains anyNULL
ProductId
s is to return no results. See the extra anti semi join and row count spool to verify this that is added to the plan.If
Products.ProductID
is also changed to becomeNULL
-able the query then becomesThe reason for that one is because a
NULL
Products.ProductId
should not be returned in the results except if theNOT IN
sub query were to return no results at all (i.e. the[Order Details]
table is empty). In which case it should. In the plan for my sample data this is implemented by adding another anti semi join as below.The effect of this is shown in the blog post already linked by Buckley. In the example there the number of logical reads increase from around 400 to 500,000.
Additionally the fact that a single
NULL
can reduce the row count to zero makes cardinality estimation very difficult. If SQL Server assumes that this will happen but in fact there were noNULL
rows in the data the rest of the execution plan may be catastrophically worse, if this is just part of a larger query, with inappropriate nested loops causing repeated execution of an expensive sub tree for example.This is not the only possible execution plan for a
NOT IN
on aNULL
-able column however. This article shows another one for a query against theAdventureWorks2008
database.For the
NOT IN
on aNOT NULL
column or theNOT EXISTS
against either a nullable or non nullable column it gives the following plan.When the column changes to
NULL
-able theNOT IN
plan now looks likeIt adds an extra inner join operator to the plan. This apparatus is explained here. It is all there to convert the previous single correlated index seek on
Sales.SalesOrderDetail.ProductID = <correlated_product_id>
to two seeks per outer row. The additional one is onWHERE Sales.SalesOrderDetail.ProductID IS NULL
.As this is under an anti semi join if that one returns any rows the second seek will not occur. However if
Sales.SalesOrderDetail
does not contain anyNULL
ProductID
s it will double the number of seek operations required.I have a table which has about 120,000 records and need to select only those which does not exist (matched with a varchar column) in four other tables with number of rows approx 1500, 4000, 40000, 200. All the involved tables have unique index on the concerned
Varchar
column.NOT IN
took about 10 mins,NOT EXISTS
took 4 secs.I have a recursive query which might had some untuned section which might have contributed to the 10 mins, but the other option taking 4 secs explains, atleast to me that
NOT EXISTS
is far better or at least thatIN
andEXISTS
are not exactly the same and always worth a check before going ahead with code.