Find the row associated with a Min/Max, without in

2019-04-20 01:51发布

问题:

I have a question related to T-SQL and SQL Server.

Let's say I have a table Orders with 2 columns:

  • ProductId int
  • CustomerId int
  • Date datetime

I want the date of the first order for every product, so I perform this type of query:

SELECT ProductId, MIN(Date) AS FirstOrder 
FROM Orders
GROUP BY ProductId

I have an index on ProductId, including the columns CustomerId and Date to speed up the query (IX_Orders). The query plan looks like a non-clustered index scan on IX_Orders, followed by a stream aggregate (no sort thanks to the index).

Now my problem is that I also want to retrieve the CustomerId associated with the first order for each product (Product 26 was first ordered on Tuesday 25, by customer 12). The tricky part is that I don't want any inner loop in the execution plan, because it would mean an additional read per ProductId in the table, which is highly inefficient.

This should just be possible using the same non-clustered index scan, followed by stream aggregates, however I can't seem to find a query that would do that. Any idea?

Thanks

回答1:

this will handle products that have duplicate dates:

DECLARE @Orders table (ProductId int
                      ,CustomerId int
                      ,Date datetime
                      )

INSERT INTO @Orders VALUES (1,1,'20090701')
INSERT INTO @Orders VALUES (2,1,'20090703')
INSERT INTO @Orders VALUES (3,1,'20090702')
INSERT INTO @Orders VALUES (1,2,'20090704')
INSERT INTO @Orders VALUES (4,2,'20090701')
INSERT INTO @Orders VALUES (1,3,'20090706')
INSERT INTO @Orders VALUES (2,3,'20090704')
INSERT INTO @Orders VALUES (4,3,'20090702')
INSERT INTO @Orders VALUES (5,5,'20090703')  --duplicate dates for product #5
INSERT INTO @Orders VALUES (5,1,'20090703')  --duplicate dates for product #5
INSERT INTO @Orders VALUES (5,5,'20090703')  --duplicate dates for product #5

;WITH MinOrders AS
(SELECT
     o.ProductId, o.CustomerId, o.Date
         ,row_number() over(partition by o.ProductId order by o.ProductId,o.CustomerId) AS RankValue
     FROM @Orders o
     INNER JOIN (SELECT
                     ProductId
                         ,MIN(Date) MinDate 
                     FROM @Orders 
                     GROUP BY ProductId
                ) dt ON o.ProductId=dt.ProductId AND o.Date=dt.MinDate
 )
SELECT
    m.ProductId, m.CustomerId, m.Date
    FROM MinOrders  m
    WHERE m.RankValue=1
    ORDER BY m.ProductId, m.CustomerId

this will return the same results, just use the same declare and inserts as the above code:

;WITH MinOrders AS
(SELECT
     o.ProductId, o.CustomerId, o.Date
         ,row_number() over(partition by o.ProductId order by o.ProductId,o.CustomerId) AS RankValue
     FROM @Orders o
 )
SELECT
    m.ProductId, m.CustomerId, m.Date
    FROM MinOrders  m
    WHERE m.RankValue=1
    ORDER BY m.ProductId, m.CustomerId

You can try out each version to see which will run faster...



回答2:

declare @Orders table (
    ProductId int,
    CustomerId int,
    Date datetime
)

insert into @Orders values (1,1,'20090701')
insert into @Orders values (2,1,'20090703')
insert into @Orders values (3,1,'20090702')
insert into @Orders values (1,2,'20090704')
insert into @Orders values (4,2,'20090701')
insert into @Orders values (1,3,'20090706')
insert into @Orders values (2,3,'20090704')
insert into @Orders values (4,3,'20090702')
insert into @Orders values (5,5,'20090703')

select O.* from @Orders O inner join 
(
    select ProductId,
    MIN(Date) MinDate 
    from @Orders 
    group by ProductId
) FO
on FO.ProductId = O.ProductId and FO.MinDate = O.Date

The estimated query plan for this is useless as I'm mocking it with table variables, but the anonymous inner join should be optimised over a subselect.



回答3:

In SQL Server 2005+:

SELECT  oo.*
FROM    (
        SELECT  DISTINCT ProductId
        FROM    Orders
        ) od
CROSS APPLY
        (
        SELECT  TOP 1 ProductID, Date, CustomerID
        FROM    Orders oi
        WHERE   oi.ProductID = od.ProductID
        ORDER BY
                Date DESC
        ) oo

Nominally, the plan for the query contains Nested Loops.

However, the outer loop will use a Index Scan with Stream Aggregate, and the inner loop will contain an Index Seek for the ProductID with a Top.

In fact, the second operation is almost free, since the index page used in the inner loop will most probably reside in the cache because it had just been used for the outer loop.

Here's the test result on 1,000,000 rows (with 100 DISTINCT ProductID's):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 1 ms.

(строк обработано: 100)
Table 'Orders'. Scan count 103, logical reads 6020, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 234 ms,  elapsed time = 125 ms.

, while this is a result of a mere SELECT DISTINCT query:

SELECT  od.*
FROM    (
        SELECT  DISTINCT ProductId
        FROM    Orders
        ) od

And the statistics:

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 1 ms.

(строк обработано: 100)
Table 'Orders'. Scan count 3, logical reads 5648, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 250 ms,  elapsed time = 125 ms.

As we can see, the performance is the same, and the CROSS APPLY takes just 400 extra logical reads (which most probably will never be physical).

Don't see how it's possible to improve this query anymore.

Also the benefit of this query is that it parallelizes nicely. You may notice that CPU time is twice the elapsed time: it's due to parallelization on my old Core Duo.

A 4-core CPU would complete this query in half of that time.

Solutions using window functions do not parallelize:

SELECT  od.*
FROM    (
        SELECT  ProductId, Date, CustomerID, ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY Date DESC) AS rn
        FROM    Orders
        ) od
WHERE   rn = 1

, and here are the statistics:

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1 ms.

(строк обработано: 100)
Table 'Orders'. Scan count 1, logical reads 5123, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 406 ms,  elapsed time = 415 ms.


回答4:

I do not see a way of doing this nicely without doing a subquery or a windowing function (such as row_number, rank) as the max only looks in one column.

However you can do it not nicely.

SELECT
    productid, 
    min(date), 
cast(
    substring( 
        min(convert(varchar(23),date,21) + cast(customerid as varchar(20)))
              , 24, 44)
    as int) customerid
from 
    orders
group by
    productid 

This only works if your customer id has less then 20 digits.

EDIT: group by clause added



回答5:

SELECT
    o1.productid, 
    o1.date, 
    o1.customerid
FROM
    Orders o1
JOIN
    (select productid, min(date) as orderDate
     from Orders
     group by productid
    ) firstOrder
ON o1.productid = firstOrder.productid

This is the best I can come up with though to be honest, I don't know what the performance characteristics of this query are. If it's no good, I'd probably suggest running two queries to get the information you want.



回答6:

Is IX_Orders sorted by ProductId, then CutomerId, then Date or is it ProductId, then Date, then CustomerId? If it is the former change it to the latter.

In other words don't use this:

create index IX_Orders on Orders (ProductId, CustomerId, Date) 

Use this instead:

create index IX_Orders on Orders (ProductId, Date, CustomerId)

Then if you do:

SELECT o1.* 
FROM [Order] o1
JOIN
    (
        SELECT ProductID, Min(Date) as Date
        FROM [Order]
        GROUP BY ProductID
    ) o2
    ON o1.ProductID = o2.ProductID AND o1.Date = o2.Date
ORDER BY ProductID

You end up with just one index scan on IX_Orders however if two customers can order the same product at the same time you could get multiple rows for each product. You can get past this by using the following query, but it is less efficient than the first:

WITH cte AS
(
    SELECT ProductID, CustomerID, Date, 
        ROW_NUMBER() OVER(PARTITION BY ProductID ORDER BY Date ASC) AS row
    FROM [Order]
)
SELECT ProductID, CustomerId, Date
FROM cte
WHERE row = 1
ORDER BY ProductID