quick selection of a random row from a large table

2019-01-07 00:48发布

问题:

What is a fast way to select a random row from a large mysql table?

I'm working in php, but I'm interested in any solution even if it's in another language.

回答1:

Grab all the id's, pick a random one from it, and retrieve the full row.

If you know the id's are sequential without holes, you can just grab the max and calculate a random id.

If there are holes here and there but mostly sequential values, and you don't care about a slightly skewed randomness, grab the max value, calculate an id, and select the first row with an id equal to or above the one you calculated. The reason for the skewing is that id's following such holes will have a higher chance of being picked than ones that follow another id.

If you order by random, you're going to have a terrible table-scan on your hands, and the word quick doesn't apply to such a solution.

Don't do that, nor should you order by a GUID, it has the same problem.



回答2:

I knew there had to be a way to do it in a single query in a fast way. And here it is:

A fast way without involvement of external code, kudos to

http://jan.kneschke.de/projects/mysql/order-by-rand/

SELECT name
  FROM random AS r1 JOIN
       (SELECT (RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
 WHERE r1.id >= r2.id
 ORDER BY r1.id ASC
 LIMIT 1;


回答3:

MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)

This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).



回答4:

Here's a solution that runs fairly quickly, and it gets a better random distribution without depending on id values being contiguous or starting at 1.

SET @r := (SELECT ROUND(RAND() * (SELECT COUNT(*) FROM mytable)));
SET @sql := CONCAT('SELECT * FROM mytable LIMIT ', @r, ', 1');
PREPARE stmt1 FROM @sql;
EXECUTE stmt1;


回答5:

Maybe you could do something like:

SELECT * FROM table 
  WHERE id=
    (FLOOR(RAND() * 
           (SELECT COUNT(*) FROM table)
          )
    );

This is assuming your ID numbers are all sequential with no gaps.



回答6:

Add a column containing a calculated random value to each row, and use that in the ordering clause, limiting to one result upon selection. This works out faster than having the table scan that ORDER BY RANDOM() causes.

Update: You still need to calculate some random value prior to issuing the SELECT statement upon retrieval, of course, e.g.

SELECT * FROM `foo` WHERE `foo_rand` >= {some random value} LIMIT 1


回答7:

An easy but slow way would be (good for smallish tables)

SELECT * from TABLE order by RAND() LIMIT 1


回答8:

In pseudo code:

sql "select id from table"
store result in list
n = random(size of list)
sql "select * from table where id=" + list[n]

This assumes that id is a unique (primary) key.



回答9:

There is another way to produce random rows using only a query and without order by rand(). It involves User Defined Variables. See how to produce random rows from a table



回答10:

In order to find random rows from a table, don’t use ORDER BY RAND() because it forces MySQL to do a full file sort and only then to retrieve the limit rows number required. In order to avoid this full file sort, use the RAND() function only at the where clause. It will stop as soon as it reaches to the required number of rows. See http://www.rndblog.com/how-to-select-random-rows-in-mysql/



回答11:

if you don't delete row in this table, the most efficient way is:

(if you know the mininum id just skip it)

SELECT MIN(id) AS minId, MAX(id) AS maxId FROM table WHERE 1

$randId=mt_rand((int)$row['minId'], (int)$row['maxId']);

SELECT id,name,... FROM table WHERE id=$randId LIMIT 1


回答12:

For selecting multiple random rows from a given table (say 'words'), our team came up with this beauty:

SELECT * FROM
`words` AS r1 JOIN 
(SELECT  MAX(`WordID`) as wid_c FROM `words`) as tmp1
WHERE r1.WordID >= (SELECT (RAND() * tmp1.wid_c) AS id) LIMIT n


回答13:

The classic "SELECT id FROM table ORDER BY RAND() LIMIT 1" is actually OK.

See the follow excerpt from the MySQL manual:

If you use LIMIT row_count with ORDER BY, MySQL ends the sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result.



回答14:

With a order yo will do a full scan table. Its best if you do a select count(*) and later get a random row=rownum between 0 and the last registry



回答15:

Take a look at this link by Jan Kneschke or this SO answer as they both discuss the same question. The SO answer goes over various options also and has some good suggestions depending on your needs. Jan goes over all the various options and the performance characteristics of each. He ends up with the following for the most optimized method by which to do this within a MySQL select:

SELECT name
  FROM random AS r1 JOIN
       (SELECT (RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
 WHERE r1.id >= r2.id
 ORDER BY r1.id ASC
 LIMIT 1;

HTH,

-Dipin



回答16:

I'm a bit new to SQL but how about generating a random number in PHP and using

SELECT * FROM the_table WHERE primary_key >= $randNr

this doesn't solve the problem with holes in the table.

But here's a twist on lassevks suggestion:

SELECT primary_key FROM the_table

Use mysql_num_rows() in PHP create a random number based on the above result:

SELECT * FROM the_table WHERE primary_key = rand_number

On a side note just how slow is SELECT * FROM the_table:
Creating a random number based on mysql_num_rows() and then moving the data pointer to that point mysql_data_seek(). Just how slow will this be on large tables with say a million rows?



回答17:

I ran into the problem where my IDs were not sequential. What I came up with this.

SELECT * FROM products WHERE RAND()<=(5/(SELECT COUNT(*) FROM products)) LIMIT 1

The rows returned are approximately 5, but I limit it to 1.

If you want to add another WHERE clause it becomes a bit more interesting. Say you want to search for products on discount.

SELECT * FROM products WHERE RAND()<=(100/(SELECT COUNT(*) FROM pt_products)) AND discount<.2 LIMIT 1

What you have to do is make sure you are returning enough result which is why I have it set to 100. Having a WHERE discount<.2 clause in the subquery was 10x slower, so it's better to return more results and limit.



回答18:

I see here a lot of solution. One or two seems ok but other solutions have some constraints. But the following solution will work for all situation

select a.* from random_data a, (select max(id)*rand() randid  from random_data) b
     where a.id >= b.randid limit 1;

Here, id, don't need to be sequential. It could be any primary key/unique/auto increment column. Please see the following Fastest way to select a random row from a big MySQL table

Thanks Zillur - www.techinfobest.com



回答19:

Use the below query to get the random row

SELECT user_firstname ,
COUNT(DISTINCT usr_fk_id) cnt
FROM userdetails 
GROUP BY usr_fk_id 
ORDER BY cnt ASC  
LIMIT 1


回答20:

In my case my table has an id as primary key, auto-increment with no gaps, so I can use COUNT(*) or MAX(id) to get the number of rows.

I made this script to test the fastest operation:

logTime();
query("SELECT COUNT(id) FROM tbl");
logTime();
query("SELECT MAX(id) FROM tbl");
logTime();
query("SELECT id FROM tbl ORDER BY id DESC LIMIT 1");
logTime();

The results are:

  • Count: 36.8418693542479 ms
  • Max: 0.241041183472 ms
  • Order: 0.216960906982 ms

Answer with the order method:

SELECT FLOOR(RAND() * (
    SELECT id FROM tbl ORDER BY id DESC LIMIT 1
)) n FROM tbl LIMIT 1

...
SELECT * FROM tbl WHERE id = $result;


回答21:

I have used this and the job was done the reference from here

SELECT * FROM myTable WHERE RAND()<(SELECT ((30/COUNT(*))*10) FROM myTable) ORDER BY RAND() LIMIT 30;


回答22:

Create a Function to do this most likely the best answer and most fastest answer here!

Pros - Works even with Gaps and extremely fast.

<?

$sqlConnect = mysqli_connect('localhost','username','password','database');

function rando($data,$find,$max = '0'){
   global $sqlConnect; // Set as mysqli connection variable, fetches variable outside of function set as GLOBAL
   if($data == 's1'){
     $query = mysqli_query($sqlConnect, "SELECT * FROM `yourtable` ORDER BY `id` DESC LIMIT {$find},1");

     $fetched_data = mysqli_fetch_assoc($query);
      if(mysqli_num_rows($fetched_data>0){
       return $fetch_$data;
      }else{
       rando('','',$max); // Start Over the results returned nothing
      }
   }else{
     if($max != '0'){
        $irand = rand(0,$max); 
        rando('s1',$irand,$max); // Start rando with new random ID to fetch
     }else{

        $query = mysqli_query($sqlConnect, "SELECT `id` FROM `yourtable` ORDER BY `id` DESC LIMIT 0,1");
        $fetched_data = mysqli_fetch_assoc($query);
        $max = $fetched_data['id'];
        $irand = rand(1,$max);
        rando('s1',$irand,$max); // Runs rando against the random ID we have selected if data exist will return
     }
   }
 }

 $your_data = rando(); // Returns listing data for a random entry as a ASSOC ARRAY
?>

Please keep in mind this code as not been tested but is a working concept to return random entries even with gaps.. As long as the gaps are not huge enough to cause a load time issue.



回答23:

Quick and dirty method:

SET @COUNTER=SELECT COUNT(*) FROM your_table;

SELECT PrimaryKey
FROM your_table
LIMIT 1 OFFSET (RAND() * @COUNTER);

The complexity of the first query is O(1) for MyISAM tables.

The second query accompanies a table full scan. Complexity = O(n)

Dirty and quick method:

Keep a separate table for this purpose only. You should also insert the same rows to this table whenever inserting to the original table. Assumption: No DELETEs.

CREATE TABLE Aux(
  MyPK INT AUTO_INCREMENT,
  PrimaryKey INT
);

SET @MaxPK = (SELECT MAX(MyPK) FROM Aux);
SET @RandPK = CAST(RANDOM() * @MaxPK, INT)
SET @PrimaryKey = (SELECT PrimaryKey FROM Aux WHERE MyPK = @RandPK);

If DELETEs are allowed,

SET @delta = CAST(@RandPK/10, INT);

SET @PrimaryKey = (SELECT PrimaryKey
                   FROM Aux
                   WHERE MyPK BETWEEN @RandPK - @delta AND @RandPK + @delta
                   LIMIT 1);

The overall complexity is O(1).



回答24:

SELECT DISTINCT * FROM yourTable WHERE 4 = 4 LIMIT 1;