I've read about the ORDER BY RAND() and its performance problems -- do these only apply to queries that return large datasets? For example, if I have a table with 100,000 rows and return a dataset with 10 records using a WHERE clause and then use ORDER BY RAND() LIMIT 1, will this ORDER BY RAND() be applied AFTER my table has been filtered down to records matching the WHERE clause, and thus have negligible performance issues?
相关问题
- Faster loop: foreach vs some (performance of jsper
- sqlyog export query result as csv
- NOT DISTINCT query in mySQL
- Why wrapping a function into a lambda potentially
- MySQL: conduct a basic search
Based on a quick test, I have to conclude that ORDER BY RAND() is applied only after the WHERE statement is applied, and not to the whole dataset.
Results from a table with 50,000 rows:
It doesn't matter how many rows you select. If you
ORDER BY RAND()
a random number is calculated for every single row in the table. This is because it must calculate the random value for every row in order to know which row generated the largest value. So if you have a table with 100,000 rows and then callORDER BY RAND() LIMIT 1
You are telling MySQL to generate a random number for 100,000 rows, sort them by that number, and then give you the first one.It is much much faster to:
SELECT COUNT(*) FROM
Table
Generate random number between 0 and result of above query minus 1 in your scripting/programming language.
SELECT * FROM
Table
LIMIT random_number_here,1The
RAND()
value will be calculated for each row, so it's not very efficient for large data sets, theLIMIT
clause doesn't change that. The usual way to work around this is to compute a random number in advance and then retrieve the row corresponding to it based on some pregenerated indexed column.Here's one detailed explanation:
http://jan.kneschke.de/projects/mysql/order-by-rand/
You're right, it will apply the ORDER BY after reducing the number of rows with WHERE, GROUP BY, and HAVING. But it will apply ORDER BY before LIMIT.
So if you filter the number of rows down sufficiently, then yes, the ORDER BY RAND() may achieve what you want without a great performance impact. There's a legitimate benefit to code that is simple and easily readable.
The trouble comes when you think your query should reduce the rows to something small, but over time as your data grows, the number of rows it needs to sort becomes large again. Since your query then does LIMIT 10 on the sorted result hides the fact that you're performing ORDER BY RAND() on 500k rows. You just see performance mysteriously getting worse.
I have written about alternative methods for picking a random row in my book SQL Antipatterns: Avoiding the Pitfalls of Database Programming, or in other answers here on Stack Overflow: