可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a question related to T-SQL and SQL Server.
Let's say I have a table Orders with 2 columns:
- ProductId int
- CustomerId int
- Date datetime
I want the date of the first order for every product, so I perform this type of query:
SELECT ProductId, MIN(Date) AS FirstOrder
FROM Orders
GROUP BY ProductId
I have an index on ProductId
, including the columns CustomerId
and Date
to speed up the query (IX_Orders
). The query plan looks like a non-clustered index scan on IX_Orders
, followed by a stream aggregate (no sort thanks to the index).
Now my problem is that I also want to retrieve the CustomerId
associated with the first order for each product (Product 26 was first ordered on Tuesday 25, by customer 12). The tricky part is that I don't want any inner loop in the execution plan, because it would mean an additional read per ProductId
in the table, which is highly inefficient.
This should just be possible using the same non-clustered index scan, followed by stream aggregates, however I can't seem to find a query that would do that. Any idea?
Thanks
回答1:
this will handle products that have duplicate dates:
DECLARE @Orders table (ProductId int
,CustomerId int
,Date datetime
)
INSERT INTO @Orders VALUES (1,1,'20090701')
INSERT INTO @Orders VALUES (2,1,'20090703')
INSERT INTO @Orders VALUES (3,1,'20090702')
INSERT INTO @Orders VALUES (1,2,'20090704')
INSERT INTO @Orders VALUES (4,2,'20090701')
INSERT INTO @Orders VALUES (1,3,'20090706')
INSERT INTO @Orders VALUES (2,3,'20090704')
INSERT INTO @Orders VALUES (4,3,'20090702')
INSERT INTO @Orders VALUES (5,5,'20090703') --duplicate dates for product #5
INSERT INTO @Orders VALUES (5,1,'20090703') --duplicate dates for product #5
INSERT INTO @Orders VALUES (5,5,'20090703') --duplicate dates for product #5
;WITH MinOrders AS
(SELECT
o.ProductId, o.CustomerId, o.Date
,row_number() over(partition by o.ProductId order by o.ProductId,o.CustomerId) AS RankValue
FROM @Orders o
INNER JOIN (SELECT
ProductId
,MIN(Date) MinDate
FROM @Orders
GROUP BY ProductId
) dt ON o.ProductId=dt.ProductId AND o.Date=dt.MinDate
)
SELECT
m.ProductId, m.CustomerId, m.Date
FROM MinOrders m
WHERE m.RankValue=1
ORDER BY m.ProductId, m.CustomerId
this will return the same results, just use the same declare and inserts as the above code:
;WITH MinOrders AS
(SELECT
o.ProductId, o.CustomerId, o.Date
,row_number() over(partition by o.ProductId order by o.ProductId,o.CustomerId) AS RankValue
FROM @Orders o
)
SELECT
m.ProductId, m.CustomerId, m.Date
FROM MinOrders m
WHERE m.RankValue=1
ORDER BY m.ProductId, m.CustomerId
You can try out each version to see which will run faster...
回答2:
declare @Orders table (
ProductId int,
CustomerId int,
Date datetime
)
insert into @Orders values (1,1,'20090701')
insert into @Orders values (2,1,'20090703')
insert into @Orders values (3,1,'20090702')
insert into @Orders values (1,2,'20090704')
insert into @Orders values (4,2,'20090701')
insert into @Orders values (1,3,'20090706')
insert into @Orders values (2,3,'20090704')
insert into @Orders values (4,3,'20090702')
insert into @Orders values (5,5,'20090703')
select O.* from @Orders O inner join
(
select ProductId,
MIN(Date) MinDate
from @Orders
group by ProductId
) FO
on FO.ProductId = O.ProductId and FO.MinDate = O.Date
The estimated query plan for this is useless as I'm mocking it with table variables, but the anonymous inner join should be optimised over a subselect.
回答3:
In SQL Server 2005+
:
SELECT oo.*
FROM (
SELECT DISTINCT ProductId
FROM Orders
) od
CROSS APPLY
(
SELECT TOP 1 ProductID, Date, CustomerID
FROM Orders oi
WHERE oi.ProductID = od.ProductID
ORDER BY
Date DESC
) oo
Nominally, the plan for the query contains Nested Loops
.
However, the outer loop will use a Index Scan
with Stream Aggregate
, and the inner loop will contain an Index Seek
for the ProductID
with a Top
.
In fact, the second operation is almost free, since the index page used in the inner loop will most probably reside in the cache because it had just been used for the outer loop.
Here's the test result on 1,000,000
rows (with 100
DISTINCT
ProductID
's):
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 1 ms.
(строк обработано: 100)
Table 'Orders'. Scan count 103, logical reads 6020, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 234 ms, elapsed time = 125 ms.
, while this is a result of a mere SELECT DISTINCT
query:
SELECT od.*
FROM (
SELECT DISTINCT ProductId
FROM Orders
) od
And the statistics:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 1 ms.
(строк обработано: 100)
Table 'Orders'. Scan count 3, logical reads 5648, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 250 ms, elapsed time = 125 ms.
As we can see, the performance is the same, and the CROSS APPLY
takes just 400
extra logical reads
(which most probably will never be physical
).
Don't see how it's possible to improve this query anymore.
Also the benefit of this query is that it parallelizes nicely. You may notice that CPU
time is twice the elapsed time
: it's due to parallelization on my old Core Duo
.
A 4-core
CPU
would complete this query in half of that time.
Solutions using window functions do not parallelize:
SELECT od.*
FROM (
SELECT ProductId, Date, CustomerID, ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY Date DESC) AS rn
FROM Orders
) od
WHERE rn = 1
, and here are the statistics:
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
(строк обработано: 100)
Table 'Orders'. Scan count 1, logical reads 5123, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 406 ms, elapsed time = 415 ms.
回答4:
I do not see a way of doing this nicely without doing a subquery or a windowing function (such as row_number, rank) as the max only looks in one column.
However you can do it not nicely.
SELECT
productid,
min(date),
cast(
substring(
min(convert(varchar(23),date,21) + cast(customerid as varchar(20)))
, 24, 44)
as int) customerid
from
orders
group by
productid
This only works if your customer id has less then 20 digits.
EDIT:
group by clause added
回答5:
SELECT
o1.productid,
o1.date,
o1.customerid
FROM
Orders o1
JOIN
(select productid, min(date) as orderDate
from Orders
group by productid
) firstOrder
ON o1.productid = firstOrder.productid
This is the best I can come up with though to be honest, I don't know what the performance characteristics of this query are. If it's no good, I'd probably suggest running two queries to get the information you want.
回答6:
Is IX_Orders sorted by ProductId, then CutomerId, then Date or is it ProductId, then Date, then CustomerId? If it is the former change it to the latter.
In other words don't use this:
create index IX_Orders on Orders (ProductId, CustomerId, Date)
Use this instead:
create index IX_Orders on Orders (ProductId, Date, CustomerId)
Then if you do:
SELECT o1.*
FROM [Order] o1
JOIN
(
SELECT ProductID, Min(Date) as Date
FROM [Order]
GROUP BY ProductID
) o2
ON o1.ProductID = o2.ProductID AND o1.Date = o2.Date
ORDER BY ProductID
You end up with just one index scan on IX_Orders however if two customers can order the same product at the same time you could get multiple rows for each product. You can get past this by using the following query, but it is less efficient than the first:
WITH cte AS
(
SELECT ProductID, CustomerID, Date,
ROW_NUMBER() OVER(PARTITION BY ProductID ORDER BY Date ASC) AS row
FROM [Order]
)
SELECT ProductID, CustomerId, Date
FROM cte
WHERE row = 1
ORDER BY ProductID