We met a strange situation on SQL Server 2008 (SP1) - 10.0.2531.0 (X64) - Win2008 SP2 (X64).
Here is a one heavy query:
select t1.id, t2.id
from t1, t2
where
t1.id = t2.ext_id
and isnull(t1.vchCol1, 'Null') = isnull(t2.vchCol1, 'Null')
and isnull(t1.vchCol2, 'Null') = isnull(t2.vchCol2, 'Null')
.... and about 10 more comparisons with Isnull
UPD: All columns in comparison (except IDs) are varchar
(~30...200)
T1 is ~130mln rows, T2 is ~300k rows.
These query on rather big Dev server run ~5 hours - this is slow, but what we can do?
And while we investigated possible ways of optimisation - we found, that changing "isnull" to "coalesce" in query above gives double performance gain - and query now run for ~2 hours
UPD: When we remove all ISNULL
checks and use just t1.vchCol1 = t2.vchCol1
the query finishes after 40mins.
Question is: Is this known behavior and we should avoid using IsNull everywhere?
I wonder if you'd see an improvement by splitting the cases out explicitly:
...
AND ((t1.vchCol1 = t2.vchCol1) OR (t1.vchCol1 IS NULL AND t2.vchCol1 IS NULL))
AND ((t1.vchCol2 = t2.vchCol2) OR (t1.vchCol2 IS NULL AND t2.vchCol2 IS NULL))
...
Most of the articles you'll find on this subject seem to contradict this. ISNULL
is (marginally) faster than COALESCE
.
Differences between ISNULL
and COALESCE
COALESCE
basically translates to CASE
expression and ISNULL
is a built-in
implemented in the database engine.
...
This will make a performance
difference and queries with COALESCE
often fare worse here.
ISNULL
vs. COALESCE
I ran these tests several times on a
few different servers, and ISNULL
appears to pretty consistently
out-perform COALESCE
by an average of
10 or 12 percent. But that's the
difference between 6 seconds and 5.3
seconds (the approximate average
runtimes per test on my servers), over
the course of a million exections.
Hardly worth the functionality and
standards compliance sacrifice, at
least in the scenarios I use these
functions for.
COALESCE
vs ISNULL
vs IS NULL OR
the best performer is IS NULL OR
case,
while the difference between all 3 of
them is minor.
You may want to consider adding a computed column to each table that holds a checksum value. Then, create an index on the ID column and the checksum value, and finally use the checksum value in the join. Something like this:
Alter Table T1 Add CheckSumId As CHECKSUM(vchCol1, vchCol2, vchCol3)
Alter Table T2 Add CheckSumId As CHECKSUM(vchCol1, vchCol2, vchCol3)
Create NonClustered index idx_T1_Checksum On T1(id, CheckSumId)
Create NonClustered index idx_T2_Checksum On T2(ext_id, CheckSumId)
Then your query would become...
select t1.id, t2.id
from t1 Inner Join t2
On t1.id = t2.ext_id
And T1.CheckSumId = T2.CheckSumId
where isnull(t1.vchCol1, 'Null') = isnull(t2.vchCol1, 'Null')
and isnull(t1.vchCol2, 'Null') = isnull(t2.vchCol2, 'Null')
This will, of course, use extra index space, but it's simply 2 integers which should be very efficient. There will also be a performance penalty for each insert, update and delete because another index needs to be maintained. However, I suspect this will have a large impact on performance.
I realize this is a year later, but...
For this kind of column-by-column comparison, you might consider using EXCEPT. Also, EXCEPT treats NULL like another value instead of "It could be anything!", as I like to call it.
"When you compare rows for determining distinct values, two NULL values are considered equal."
--from http://msdn.microsoft.com/en-us/library/ms188055.aspx