Optimizing my mysql statement! - RAND() TOO SLOW

2019-01-19 14:39发布

So I have a table with over 80,000 records, this one is called system. I also have another table called follows.

I need my statement to randomly select records from the system table, where that id is not already listed within the follows table under the current userid.

So here is what I have:

    SELECT system.id, 
           system.username, 
           system.password, 
           system.followed, 
           system.isvalid, 
           follows.userid, 
           follows.systemid
      FROM system
  LEFT JOIN follows ON system.id = follows.systemid
                   AND follows.userid = 2 
      WHERE system.followed = 0 
        AND system.isvalid = 1
        AND follows.systemid IS NULL
   ORDER BY RAND()
      LIMIT 200

Now it wotks perfectly, except that it takes about a whole minute before it can even start processing the job at hand with the records it chosen. By this time the script usually times oout and nothing happens.

Can somebody show me how to rework this, so the same idea is done, but it is not using order by rand? This seems to slow things down a whole bunch.

Thanks!

标签: sql mysql random
6条回答
在下西门庆
2楼-- · 2019-01-19 14:59

I am not sure there is a simple solution to replace your query, here is an article on correcting this type of issue.

http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

查看更多
孤傲高冷的网名
3楼-- · 2019-01-19 15:03

There are two main reasons for the slowness :

  • SQL must first issue a random number for each of the rows
  • The rows must then be ordered on the basis of this number to select the top 200 ones

There is a trick to help this situation, it requires a bit of prep work and the way to implement it (and its relative interest) depends on your actual use case.

==> Introduce an extra column with a "random category" value to filter-out most rows

The idea is to have an integer-valued column with values randomly assigned, once at prep time, with a value between say 0 and 9 (or 1 and 25... whatever). This column then needs to be added to the index used in the query. Finaly, by modifying the query to include a filter on this column = a particular value (say 3), the number of rows which SQL needs to handle is then reduced by 10 (or 25, depending on the number of distinct values we have in the "random category".

Assuming this new column is called RandPreFilter, we could introduced an index like

CREATE [UNIQUE ?] INDEX  
ON system (id, RandPreFilter)

And alter the query as follows

SELECT system.id
     , system.username
     , system.password
     , system.followed
     , system.isvalid
     , follows.userid
     , follows.systemid
FROM system
LEFT JOIN follows ON system.id = follows.systemid
   AND follows.userid = 2 
WHERE system.followed=0 AND system.isvalid=1
   AND follows.systemid IS NULL

   AND RandPreFilter = 1 -- or other numbers, or possibly 
        -- FLOOR(1 + RAND() * 25)
ORDER BY RAND()
LIMIT 200
查看更多
时光不老,我们不散
4楼-- · 2019-01-19 15:09

You can generate some pseudo random value based on the ids and the current time:

ORDER BY 37*(UNIX_TIMESTAMP() ^ system.id) & 0xffff

will mix bites from the id, and then will take only the lowest 16.

查看更多
唯我独甜
5楼-- · 2019-01-19 15:11

Depending on how random your data needs to be it might be worth ordering the data and adding an extra "last used" datetime column and update this once you use the data. Then select the top n rows ordering by the last used field descending.

If you wrap this in a prepared statement you can select one (semi) random result at a time without worrying about the logic.

Alternatively give every row a sequential ID and generate the randomness in the code and pull back just the required rows. The problem is that the full recordset is being returned prior to being ordered.

查看更多
Fickle 薄情
6楼-- · 2019-01-19 15:12

The reason the query is slow is that the database needs to keep a representation of all the generated random values and their respective data before it can return even a single row from the database. What you can do is to limit the number of candidate rows to consider first by using WHERE RAND() < x, where you select x to be a number likely to return at least the number of samples you need. To get a true random sample you would then need to order by RAND again or do sampling on the returned dataset.

Using this approach allows the database to process the query in a streaming fashion without having to build a large intermediate representation of all data. The drawback is that you can never be 100% sure that you get the number of samples you need, so you might have to perform the query again until you do, live with a smaller sample set or incrementally add samples (making sure to avoid duplicates) until you have the number of samples you need.

If you don't require the query to return different results for each call you could also add a pre-generated random value column with an index and combine with the above technique. It would allow you to get any number of samples in a fair manner, even if you add or delete rows, but the same query on the same data would of course return the same result set.

查看更多
霸刀☆藐视天下
7楼-- · 2019-01-19 15:20

Perhaps a little late, but at least here is an extra solution for future consideration:

SELECT minSystem.id, 
    minSystem.username, 
    minSystem.password, 
    minSystem.followed, 
    minSystem.isvalid,
    randFollows.userid, 
    randFollows.systemid
FROM
(
    SELECT *
    FROM system
    WHERE system.followed = 0 AND system.isvalid = 1
) as minSystem
LEFT JOIN 
(
    SELECT * 
    FROM (
        SELECT *
        FROM follows
        WHERE follows.systemid IS NULL
    ) as minFollows
    WHERE rand() <= 200 * 1.5 / (SELECT count(*) FROM follows WHERE systemid IS NULL)
) as randFollows
ON minSystem.id = randFollows.systemid
LIMIT 200

First, we perform a selection on the system table to cut down the minSystem and minFollow temp table size. Then we select random rows from the minFollows table through calculated probability. By now we will have a fairly random randFollows table to LEFT JOIN with minSystem. Finally we do LIMIT 200.

If you are using MyISam, you can simply retrieve the table size. This eliminates the extra subquery to calculate the follows table size. Alternatively, you can also hardcode the denominator if your table size doesn't grow too fast (This requires more manual maintenance however).

For more thorough explaination, please checkout the solution I posted on: MySQL: Alternatives to ORDER BY RAND()

Hope this helps (or at least I hope you'll find this interesting)!

查看更多
登录 后发表回答