udf vs direct sql performance

Using MSSQL 2005

I was playing around today with a Scalar UDF in a where statement to see some of the costs associated with making the call and io differences etc.

I'm starting with 2 basic tables. Customer which has 1 million rows. and Purchases which has 100,000. Both have an auto identity column as Primary key. No other indexes defined.

DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS

SET STATISTICS IO ON
    SELECT * FROM Customer C 
    INNER JOIN Purchases P on C.[IDENTITY] = P.CustomerID
    WHERE P.Amount > 1000
SET STATISTICS IO OFF

This returns IO statistics of

Table 'Customer'. Scan count 0, logical reads 3295, physical reads 1, read-ahead reads 32, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Purchases'. Scan count 1, logical reads 373, physical reads 1, read-ahead reads 370, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

So just to see the impact of a scalar UDF I then just moved the P.Amount > 1000 to a UDF. Function is as follows:

CREATE FUNCTION [dbo].[HighValuePurchase]
(
    @value int
)
RETURNS bit
AS
BEGIN
    DECLARE @highValue bit
    SET @highValue = '0'

    IF @value > 1000
    BEGIN
        SET @highValue = '1'
    END
    RETURN @highValue
END

So I then ran the following query:

DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS

SET STATISTICS IO ON      
    SELECT * FROM Customer C 
    INNER JOIN Purchases P on C.[IDENTITY] = P.CustomerID
    WHERE dbo.HighValuePurchase(P.Amount) = '1'
SET STATISTICS IO OFF

I was expecting this to run worse. This query returned the following IO statistics:

Table 'Purchases'. Scan count 1, logical reads 373, physical reads 1, read-ahead reads 370, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Customer'. Scan count 1, logical reads 35, physical reads 3, read-ahead reads 472, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

This also returned faster than the > 1000 query. While the same rows were returned the ordering of the one calling the UDF was automatically sorted by C.[IDENTITY] where the other query appeared unsorted. This is likely due to the way the combines were done in the execution plans. Outline of the plans is below.

Execution plan for the non UDF shows a Clustered Index scan for Purchases and a Clustered Index seek for Customers combined at a nested join.

Execution plan for the UDF version shows a clustered index scan for purchases, then a filter, then a sort. There's a clustered Index scan on Customer. Then the results are combined in a Merge Join.

I'm sure this has to do with lack of indexes etc, but I'm unsure why these results are the way they are. I've experienced UDF's running painfully slow and everyone says using them is usually a bad idea, which is why I threw this test together. I can't explain currently why the UDF version seems to be so much better.

If you want to join on Purchases.CustomerID you should put an index on it.
If you often query on value ranges you should put an index on that too.

As it is you are asking SQL server to choose between two bad plans.

SQL Server can guess roughly how many purchases will be covered by the > 1000 query, and will pick a plan based on that.

However it can't guess how many will be covered by the UDF query, so may pick a different plan. Because it is going on ignorance it might be better or worse than the other plan depending on how good it's guess is.

You can see the plans generated and it will tell you the estimated number of rows in each plan, and also the actual number. Those estimated numbers account for the choice of plan in each case.