I'm looking for the official T-SQL documentation for "ORDER BY RAND()" and "ORDER BY NEWID()". There are numerous articles describing them, so they must be documented somewhere.
I'm looking for a link to an official SQL Server documentation page like this: http://technet.microsoft.com/en-us/library/ms188385.aspx
CLARIFICATION:
What I'm looking for is the documentation for "order_by_expression" that explains the difference in behavior between a nonnegative integer constant, a function that returns a nonnegative integer, and a function that returns any other value (like RAND() or NEWID()).
ANSWER:
I appologize for the lack of clarity in my original question. As with most programming-related problems, the solution to the problem is primarily figuring out what question you're actually trying to answer.
Thank you everyone.
The answer is in this document: From: http://www.wiscorp.com/sql200n.zip
Information technology — Database languages — SQL — Part 2: Foundation (SQL/Foundation)
22.2 <direct select statement: multiple rows> includes a <cursor specification>.
At this point we have the first half of the answer:
A SELECT statment is a type of CURSOR, which means that operations can be performed iteratively on each row. Although I haven't found a statement in the docs that explicity says it, I'm content to assume that the expression in the order_by_expression will be executed for each row.
Now it makes sense what is happening when you use RAND() or NEWID() or CEILING(RAND() + .5) / 2 as opposed to a numeric constant or a column name.
The expression will never be treated like a column number. It will always be a value that is generated for each row which will be used as the basis for determining the order of the rows.
However, for thoroughness, let's continue to the full definition of what an expression can be.
14.3 <cursor specification> includes ORDER BY <sort specification list>.
10.10 <sort specification list> defines:
<sort specification> ::= <sort key> [ <ordering specification> ] [ <null ordering> ]
<sort key> ::= <value expression>
<ordering specification> ::= ASC | DESC
<null ordering> ::= NULLS FIRST | NULLS LAST
Which takes us to:
6.25 <value expression>
Where we find the second half of the answer:
<value expression> ::=
<common value expression>
| <boolean value expression>
| <row value expression>
<common value expression> ::=
<numeric value expression>
| <string value expression>
| <datetime value expression>
| <interval value expression>
| <user-defined type value expression>
| <reference value expression>
| <collection value expression>
<user-defined type value expression> ::= <value expression primary>
<reference value expression> ::= <value expression primary>
<collection value expression> ::= <array value expression> | <multiset value expression>
From here we descend into the numerous possibile types of expressions that can be used.
NEWID() returns a uniqueidentifier.
It seems reasonable to assume that uniqueidentifiers are compared numerically, so if expression is NEWID() our <common value expression> will be a <numeric value expression>.
Similarly, RAND() returns a numeric value, and it will also be evaluated as a <numeric value expression>.
So, although I wasn't able to find anything in Microsoft's offical documentation that explains what ORDER BY does when called using an order_by_expression that is an expression, it really is documented, as I knew it must be.
If we're being a stickler for details, the question you asked was essentially "Where's the docs for ~". The answer is nowhere, there is no doc like the one you're looking for.
Not a single one anyway, there are multiple docs that treat NEWID(), RAND() and ORDER BY separately and you have to put the pieces together yourself.
Basically,
This lets you know it's valid syntax, but there's no single link for you to point to.
Check out the links below.
ORDER BY, RAND and NEWID are statement and functions part of the TSQL language.
Combining them to randomly select or generate data is a design pattern.
See the first two articles.
Generate random integers without collisions
MSDN - Selecting Rows Randomly from a Large Table
MSDN - RAND
MSDN - NEWID
MSDN - ORDER BY
Very good read Aaron.
But again, taken separately (RAND, NEWID, ORDER BY) are elements part of a TSQL language.
Using them to randomly choose data is a design pattern.
Also, you can call RAND() in a while loop - RBAR() produce random numbers.
That is because in the query plan, RAND(), is no longer a constant.
If you're trying to determine why these behave differently, the reason is simple: one is evaluated once, and treated as a runtime constant (
RAND()
), while the other is evaluated for every single row (NEWID()
). Observe this simple example:Results:
Now, if you apply an order by to the left column, SQL Server says, ok, but every single value is the same, so I'm basically just to ignore your request and move on to the next ORDER BY column. If there isn't one, then SQL Server will default to returning the rows in whatever order it deems most efficient.
If you apply an order by to the right column, now SQL Server actually has to sort all of the values. This introduces a
Sort
(or aTopN Sort
ifTOP
is used) operator into the plan, and is likely going to take more CPU (though overall duration may not be substantially affected, depending on the size of the set and other factors).Let's compare the plans for these two queries:
The plan:
There is no sort operator going on, and both of the scans are
Ordered = False
- this means that SQL Server has not decided to explicitly implement any ordering, but this certainly does not mean that the order will be any different on each execution - it just means that the order is non-deterministic (unless you add a secondaryORDER BY
- but even in that case, theRAND()
ordering is still ignored because, well, it's the same value on every row).And now
NEWID()
:The plan:
There is a new
Sort
operator there, which means that SQL Server must reorder all the rows to be returned in the order of the generated GUID values on each row. The scans of course are still unordered, but theSort
ultimately applies the order.I don't know that this specific implementation detail is officially documented anywhere, though I did find this article which includes an explicit
ORDER BY NEWID()
. I doubt you'll find anything official that documentsORDER BY RAND()
in any way, because that just doesn't make any sense to do, officially supported or not.Re: the comment that SQL Server assigns
a seed value at random
- this should not be interpreted asa seed value **per row** at random
. Demonstration:Results:
On my machine, this took about 15 seconds to run, and the results were always the same for both
MIN
andMAX
. Keep increasing the number of rows returned and the amount of time it takes, and I guarantee you will continue to see the exact same value forRAND()
on every row. It is calculated exactly once, and that is not because SQL Server is wise to the fact that I am not returning all of the rows. This also yielded the same result (and it took just under 2 minutes to populate the entire table with 72 million rows):(In fact the
SELECT
took almost as long as the initial population. Do not try this on a single-core laptop with 4GB of RAM.)The result: