Where is the official documentation for T-SQL'

2019-09-21 03:54发布

I'm looking for the official T-SQL documentation for "ORDER BY RAND()" and "ORDER BY NEWID()". There are numerous articles describing them, so they must be documented somewhere.

I'm looking for a link to an official SQL Server documentation page like this: http://technet.microsoft.com/en-us/library/ms188385.aspx

CLARIFICATION:

What I'm looking for is the documentation for "order_by_expression" that explains the difference in behavior between a nonnegative integer constant, a function that returns a nonnegative integer, and a function that returns any other value (like RAND() or NEWID()).


ANSWER:

I appologize for the lack of clarity in my original question. As with most programming-related problems, the solution to the problem is primarily figuring out what question you're actually trying to answer.

Thank you everyone.


The answer is in this document: From: http://www.wiscorp.com/sql200n.zip

Information technology — Database languages — SQL — Part 2: Foundation (SQL/Foundation)

22.2 <direct select statement: multiple rows> includes a <cursor specification>.

At this point we have the first half of the answer:

A SELECT statment is a type of CURSOR, which means that operations can be performed iteratively on each row. Although I haven't found a statement in the docs that explicity says it, I'm content to assume that the expression in the order_by_expression will be executed for each row.

Now it makes sense what is happening when you use RAND() or NEWID() or CEILING(RAND() + .5) / 2 as opposed to a numeric constant or a column name.
The expression will never be treated like a column number. It will always be a value that is generated for each row which will be used as the basis for determining the order of the rows.

However, for thoroughness, let's continue to the full definition of what an expression can be.

14.3 <cursor specification> includes ORDER BY <sort specification list>.

10.10 <sort specification list> defines:

<sort specification> ::= <sort key> [ <ordering specification> ] [ <null ordering> ]
    <sort key> ::= <value expression>
    <ordering specification> ::= ASC | DESC
    <null ordering> ::= NULLS FIRST | NULLS LAST

Which takes us to:

6.25 <value expression>

Where we find the second half of the answer:

<value expression> ::= 
      <common value expression> 
    | <boolean value expression> 
    | <row value expression>

<common value expression> ::= 
      <numeric value expression> 
    | <string value expression>
    | <datetime value expression>
    | <interval value expression>
    | <user-defined type value expression>
    | <reference value expression>
    | <collection value expression>

    <user-defined type value expression> ::= <value expression primary>
    <reference value expression> ::= <value expression primary>
    <collection value expression> ::= <array value expression> | <multiset value expression>

From here we descend into the numerous possibile types of expressions that can be used.

NEWID() returns a uniqueidentifier.
It seems reasonable to assume that uniqueidentifiers are compared numerically, so if expression is NEWID() our <common value expression> will be a <numeric value expression>.

Similarly, RAND() returns a numeric value, and it will also be evaluated as a <numeric value expression>.

So, although I wasn't able to find anything in Microsoft's offical documentation that explains what ORDER BY does when called using an order_by_expression that is an expression, it really is documented, as I knew it must be.

3条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-09-21 04:11

If we're being a stickler for details, the question you asked was essentially "Where's the docs for ~". The answer is nowhere, there is no doc like the one you're looking for.

Not a single one anyway, there are multiple docs that treat NEWID(), RAND() and ORDER BY separately and you have to put the pieces together yourself.

Basically,

This lets you know it's valid syntax, but there's no single link for you to point to.

查看更多
叛逆
3楼-- · 2019-09-21 04:25

Check out the links below.

ORDER BY, RAND and NEWID are statement and functions part of the TSQL language.

Combining them to randomly select or generate data is a design pattern.

See the first two articles.

Generate random integers without collisions

http://www.sqlperformance.com/2013/09/t-sql-queries/random-collisions

MSDN - Selecting Rows Randomly from a Large Table

http://msdn.microsoft.com/en-us/library/cc441928.aspx

MSDN - RAND

http://technet.microsoft.com/en-us/library/ms177610.aspx

MSDN - NEWID

http://msdn.microsoft.com/fr-fr/library/ms190348.aspx

MSDN - ORDER BY

http://technet.microsoft.com/en-us/library/ms188385.aspx

Very good read Aaron.

But again, taken separately (RAND, NEWID, ORDER BY) are elements part of a TSQL language.

Using them to randomly choose data is a design pattern.

Also, you can call RAND() in a while loop - RBAR() produce random numbers.

That is because in the query plan, RAND(), is no longer a constant.

-- RBAR solution
declare @x float = 0;
declare @y int = 0;
while (@y < 100)
begin
    set @x = rand();
    print @x;
    set @y += 1;
end;
go

enter image description here

查看更多
▲ chillily
4楼-- · 2019-09-21 04:29

If you're trying to determine why these behave differently, the reason is simple: one is evaluated once, and treated as a runtime constant (RAND()), while the other is evaluated for every single row (NEWID()). Observe this simple example:

SELECT TOP (5) RAND(), NEWID() FROM sys.objects;

Results:

0.240705716465209        8D5D2B55-E5DE-4FF9-BA84-BC82F37B8F3A
0.240705716465209        C4CBF1CA-E6D0-4076-B6A6-5048EA612048
0.240705716465209        9BFAE5BB-B5B9-47DE-B8F9-77AAEFA5F9DB
0.240705716465209        89FFD8A1-AC73-4CEB-A5C0-00A76D040382
0.240705716465209        BCC89923-735E-43B3-9ECA-622A8C98AD7D

Now, if you apply an order by to the left column, SQL Server says, ok, but every single value is the same, so I'm basically just to ignore your request and move on to the next ORDER BY column. If there isn't one, then SQL Server will default to returning the rows in whatever order it deems most efficient.

If you apply an order by to the right column, now SQL Server actually has to sort all of the values. This introduces a Sort (or a TopN Sort if TOP is used) operator into the plan, and is likely going to take more CPU (though overall duration may not be substantially affected, depending on the size of the set and other factors).

Let's compare the plans for these two queries:

SELECT RAND() FROM sys.all_columns ORDER BY RAND();

The plan:

enter image description here

There is no sort operator going on, and both of the scans are Ordered = False - this means that SQL Server has not decided to explicitly implement any ordering, but this certainly does not mean that the order will be any different on each execution - it just means that the order is non-deterministic (unless you add a secondary ORDER BY - but even in that case, the RAND() ordering is still ignored because, well, it's the same value on every row).

And now NEWID():

SELECT NEWID() FROM sys.all_columns ORDER BY NEWID();

The plan:

enter image description here

There is a new Sort operator there, which means that SQL Server must reorder all the rows to be returned in the order of the generated GUID values on each row. The scans of course are still unordered, but the Sort ultimately applies the order.

I don't know that this specific implementation detail is officially documented anywhere, though I did find this article which includes an explicit ORDER BY NEWID(). I doubt you'll find anything official that documents ORDER BY RAND() in any way, because that just doesn't make any sense to do, officially supported or not.

Re: the comment that SQL Server assigns a seed value at random - this should not be interpreted as a seed value **per row** at random. Demonstration:

SELECT MAX(r), MIN(r) FROM 
(
  SELECT RAND() FROM sys.all_columns AS s1 
  CROSS JOIN sys.all_columns AS s2
) AS x(r);

Results:

0.4866202638872        0.4866202638872

On my machine, this took about 15 seconds to run, and the results were always the same for both MIN and MAX. Keep increasing the number of rows returned and the amount of time it takes, and I guarantee you will continue to see the exact same value for RAND() on every row. It is calculated exactly once, and that is not because SQL Server is wise to the fact that I am not returning all of the rows. This also yielded the same result (and it took just under 2 minutes to populate the entire table with 72 million rows):

SELECT RAND() AS r INTO #x 
      FROM sys.all_columns AS s1 
CROSS JOIN sys.all_columns AS s2
CROSS JOIN sys.all_columns AS s3;

SELECT MAX(r), MIN(r) FROM #x;

(In fact the SELECT took almost as long as the initial population. Do not try this on a single-core laptop with 4GB of RAM.)

The result:

0.302690214345828        0.302690214345828
查看更多
登录 后发表回答