Why is a UDF so much slower than a subquery?

2019-01-06 19:42发布

问题:

I have a case where I need to translate (lookup) several values from the same table. The first way I wrote it, was using subqueries:

SELECT
    (SELECT id FROM user WHERE user_pk = created_by) AS creator,
    (SELECT id FROM user WHERE user_pk = updated_by) AS updater,
    (SELECT id FROM user WHERE user_pk = owned_by) AS owner,
    [name]
FROM asset

As I'm using this subquery a lot (that is, I have about 50 tables with these fields), and I might need to add some more code to the subquery (for example, "AND active = 1" ) I thought I'd put these into a user-defined function UDF and use that. But the performance using that UDF was abysmal.

CREATE FUNCTION dbo.get_user ( @user_pk INT )
RETURNS INT
AS BEGIN 
    RETURN ( SELECT id
             FROM   ice.dbo.[user]
             WHERE  user_pk = @user_pk )
END

SELECT dbo.get_user(created_by) as creator, [name]
FROM asset

The performance of #1 is less than 1 second. Performance of #2 is about 30 seconds...

Why, or more importantly, is there any way I can code in SQL server 2008, so that I don't have to use so many subqueries?

Edit:

Just a litte more explanation of when this is useful. This simple query (that is, get userid) gets a lot more complex when I want to have a text for a user, since I have to join with profile to get the language, with a company to see if the language should be fetch'ed from there instead, and with the translation table to get the translated text. And for most of these queries, performance is a secondary issue to readability and maintainability.

回答1:

The UDF is a black box to the query optimiser so it's executed for every row. You are doing a row-by-row cursor. For each row in an asset, look up an id three times in another table. This happens when you use scalar or multi-statement UDFs (In-line UDFs are simply macros that expand into the outer query)

One of many articles on the problem is "Scalar functions, inlining, and performance: An entertaining title for a boring post".

The sub-queries can be optimised to correlate and avoid the row-by-row operations.

What you really want is this:

select 
   uc.id as creator,
   uu.id as updater,
   uo.id as owner,
   a.[name]
from
    asset a
    JOIN
    user uc ON uc.user_pk = a.created_by
    JOIN
    user uu ON uu.user_pk = a.updated_by
    JOIN
    user uo ON uo.user_pk = a.owned_by


回答2:

As other posters have suggested, using joins will definitely give you the best overall performance.

However, since you've stated that that you don't want the headache of maintaining 50-ish similar joins or subqueries, try using an inline table-valued function as follows:

CREATE FUNCTION dbo.get_user_inline (@user_pk INT)
RETURNS TABLE AS
RETURN
(
    SELECT TOP 1 id
    FROM ice.dbo.[user]
    WHERE user_pk = @user_pk
        -- AND active = 1
)

Your original query would then become something like:

SELECT
    (SELECT TOP 1 id FROM dbo.get_user_inline(created_by)) AS creator,
    (SELECT TOP 1 id FROM dbo.get_user_inline(updated_by)) AS updater,
    (SELECT TOP 1 id FROM dbo.get_user_inline(owned_by)) AS owner,
    [name]
FROM asset

An inline table-valued function should have better performance than either a scalar function or a multistatement table-valued function.

The performance should be roughly equivalent to your original query, but any future changes can be made in the UDF, making it much more maintainable.



回答3:

To get the same result (NULL if user is deleted or not active).

 select 
    u1.id as creator,
    u2.id as updater,
    u3.id as owner,
    [a.name]
 FROM asset a
        LEFT JOIN user u1 ON (u1.user_pk = a.created_by AND u1.active=1) 
        LEFT JOIN user u2 ON (u2.user_pk = a.created_by AND u2.active=1) 
        LEFT JOIN user u3 ON (u3.user_pk = a.created_by AND u3.active=1) 


回答4:

Am I missing something? Why can't this work? You are only selecting the id which you already have in the table:

select created_by as creator, updated_by as updater, 
owned_by as owner, [name]
from asset

By the way, in designing you really should avoid keywords, like name, as field names.