Two radically different queries against 4 mil reco

2019-06-23 18:41发布

问题:

I'm using SQL Server 2008. I have a table with over 3 million records, which is related to another table with a million records.

I have spent a few days experimenting with different ways of querying these tables. I have it down to two radically different queries, both of which take 6s to execute on my laptop.

The first query uses a brute force method of evaluating possibly likely matches, and removes incorrect matches via aggregate summation calculations.

The second gets all possibly likely matches, then removes incorrect matches via an EXCEPT query that uses two dedicated indexes to find the low and high mismatches.

Logically, one would expect the brute force to be slow and the indexes one to be fast. Not so. And I have experimented heavily with indexes until I got the best speed.

Further, the brute force query doesn't require as many indexes, which means that technically it would yield better overall system performance.

Below are the two execution plans. If you can't see them, please let me know and I'll re-post then in landscape orientation / mail them to you.

Brute-force query:

SELECT      ProductID, [Rank]
FROM        (
            SELECT      p.ProductID, ptr.[Rank], SUM(CASE
                            WHEN p.ParamLo < si.LowMin OR
                            p.ParamHi > si.HiMax THEN 1
                            ELSE 0
                            END) AS Fail
            FROM        dbo.SearchItemsGet(@SearchID, NULL) AS si
                        JOIN dbo.ProductDefs AS pd
            ON          pd.ParamTypeID = si.ParamTypeID
                        JOIN dbo.Params AS p
            ON          p.ProductDefID = pd.ProductDefID
                        JOIN dbo.ProductTypesResultsGet(@SearchID) AS ptr
            ON          ptr.ProductTypeID = pd.ProductTypeID
            WHERE       si.Mode IN (1, 2)
            GROUP BY    p.ProductID, ptr.[Rank]
            ) AS t
WHERE       t.Fail = 0

Index-based exception query:

with si AS (
    SELECT      DISTINCT pd.ProductDefID, si.LowMin, si.HiMax
    FROM        dbo.SearchItemsGet(@SearchID, NULL) AS si
                JOIN dbo.ProductDefs AS pd
    ON          pd.ParamTypeID = si.ParamTypeID
                JOIN dbo.ProductTypesResultsGet(@SearchID) AS ptr
    ON          ptr.ProductTypeID = pd.ProductTypeID
    WHERE       si.Mode IN (1, 2)
)
SELECT      p.ProductID
FROM        dbo.Params AS p
            JOIN si
ON          si.ProductDefID = p.ProductDefID
EXCEPT
SELECT      p.ProductID
FROM        dbo.Params AS p
            JOIN si
ON          si.ProductDefID = p.ProductDefID    
WHERE       p.ParamLo < si.LowMin OR p.ParamHi > si.HiMax

My question is, based on the execution plans, which one look more efficient? I realize that thing may change as my data grows.

EDIT:

I have updated the indexes, and now have the following execution plan for the second query:

回答1:

Trust the optimizer.

Write the query that most simply expresses what you're trying to achieve. If you're having perfomance problems with that query, then you should look at whether there are any missing indexes. But you still shouldn't have to explicitly work with these indexes.

Don't concern yourself by considerations of how you might implement such a search.

In very rare circumstances, you may need to further force the query to use particular indexes (via hints), but this is probably < 0.1% of queries.


In your posted plans, your "optimized" version is causing scans against 2 indexes of your (I presume) Params table (PK_Params_1, IX_Params_1). Without seeing the queries, it's difficult to know why this is happening, but if you're comparing against having a single scan against a table ("Brute force") and two, it's easy to see why the second isn't more efficient.


I think I'd try:

        SELECT      p.ProductID, ptr.[Rank]
        FROM        dbo.SearchItemsGet(@SearchID, NULL) AS si
                    JOIN dbo.ProductDefs AS pd
        ON          pd.ParamTypeID = si.ParamTypeID
                    JOIN dbo.Params AS p
        ON          p.ProductDefID = pd.ProductDefID
                    JOIN dbo.ProductTypesResultsGet(@SearchID) AS ptr
        ON          ptr.ProductTypeID = pd.ProductTypeID

LEFT JOIN Params p_anti
    on p_anti.ProductDefId = pd.ProductDefID and
         (p_anti.ParamLo < si.LowMin or p_anti.ParamHi > si.HiMax)


        WHERE       si.Mode IN (1, 2)

AND p_anti.ProductID is null

        GROUP BY    p.ProductID, ptr.[Rank]

I.e. introduce an anti-join that eliminates the results you don't want.



回答2:

In SQL Server Management Studio, put both queries in the same query window and get the query plan for both at once. It should determine the query plans for both and give you a 'percent of total batch' for each one. The query with the lower percent of the total batch will be the better performing one.



回答3:

Does 6 seconds on a laptop = .006 seconds on productions hardware? The part of your queries which worry me are the clustered index scans shown in the query plan. In my experience any time a query plan includes a CI scan it means the query will only get slower when data is added to the table.

What do the two functions yield as it appears they are the cause of the table scans? Is it possible to persist the data in the db and update the LoMin and HiMax as rows are added.

Looking at the two execution plans neither is very good. Look how far to the left the wide lines are. The wide lines means there are many rows. We need to reduce the number of rows earlier in the process so we do not work with such large hash tables and large sorts and nested loops.

BTW how many rows does your source have and how many rows are included in the result set?



回答4:

Thank you all for your input and help.

From reading what you wrote, experimenting, and digging into the execution plan, I discovered the answer is tipping point.

There were too many records being returned to warrant use of the index.

See here (Kimberly Tripp).